Samsung's F2FS filesystem
Samsung's F2FS filesystem
Posted Oct 7, 2012 0:17 UTC (Sun) by travelsn (guest, #48694)Parent article: Samsung's F2FS filesystem
Would be nice to see some data on how this file system compares to JFFS2 or UBIFS?
Posted Oct 8, 2012 9:12 UTC (Mon)
by Felix.Braun (guest, #3032)
[Link]
Posted Oct 8, 2012 12:43 UTC (Mon)
by gb (subscriber, #58328)
[Link] (16 responses)
jffs2 and ubifs are for devices without FTL, f2fs one is for consumer devices with FTL layer, so it makes little sense to compare this... Of course it would be nice to compare flash with FTL layer with flash without FTL, but it is not realistic that someone would do such comparison.
Posted Oct 8, 2012 18:41 UTC (Mon)
by jpfrancois (subscriber, #65948)
[Link] (14 responses)
Maybe there are some well known limitations to the FTL found in the typical SD card, but I don't know them, so it is not clear to me how F2FS happens to be Nand friendly, when precisely the nand characteristic are carefully hidden by the SD card embedded controller.
Posted Oct 8, 2012 19:52 UTC (Mon)
by khim (subscriber, #9252)
[Link] (1 responses)
Yes. FTL is supposed to transparently hide the nature of flash under it. But this layer is extremely leaky. So it makes perfect sense to design filesystem which makes it happier. Well, it's only makes sense in a world where devices with untrunoffable FTS exist, but, for better or for worse, this is exactly how our world is. It's impossible to hide them. The fact that flash erase block are much higher than write blocks is central feature of flash and it can not be ever fully hidden. Any somewhat sane FTS will handle sequential writes just fine and random writes worse. To the degree that it's speed will drop 10 or 100 times beyond certain threshold. Random reads are fine. These two fundamental, unfixable characteristics are enough to design a filesystem.
Posted Oct 10, 2012 9:05 UTC (Wed)
by dwmw2 (subscriber, #2063)
[Link]
The only really fundamental thing you can know is that you need some kind of TRIM mechanism to tell it that you no longer care about the contents of certain sectors, otherwise it'll faithfully copy that obsolete data around the flash as it garbage collects, causing lots of lovely write amplification and shortening the lifetime of the flash. It's a shame that TRIM is such a performance problem in many current implementations...
What you actually want to do is group long-term stable data together into the same flash eraseblocks, and short-term volatile data together into other eraseblocks. That way, when we need to do garbage collection to reclaim dirty flash space, we can make maximum progress for the least amount of copying. But you basically have no idea how to do that on an FTL which is a black box.
Posted Oct 9, 2012 15:45 UTC (Tue)
by dwmw2 (subscriber, #2063)
[Link] (11 responses)
So while in theory you can treat it as "just a disk", in practice that's almost never the case. Even in theory that fantasy doesn't hold true once you start paying attention — even a decent high-end SSD needs layering violations like TRIM in order to maintain efficiency and stop it from garbage collecting (and thus suffering write amplification) on data it doesn't even need any more.
So Samsung have come up with a "disk" file system which is optimised to work around the failings of the eMMC devices that they are using. Which is all very well for Samsung, who make their own MMC devices too. But the problem for everyone else is that you don't actually know about the internals of one of these devices. It's just a black box to you. You could do extensive testing and find one of the rare ones that does actually survive extended powerfail testing, and that works efficiently with this type of software workaround — but manufacturers have a history of changing the internals dramatically without even changing the model number, so the next batch you order could be completely different. Instead of aligning your the allocation units of your "disk" file system to carefully line up with NAND eraseblock, due to your internal knowledge of the the way the internal FTL happens to lay stuff out, you could suddenly find that your "disk" file system is laying stuff out precisely wrong for the lower layer instead.
Any optimisation you attempt at this level is a layering violation and is doomed from the beginning unless you can control the internals of the MMC device too. Which you can't. Although Samsung can. So that's nice for them.
Really, people who are serious about embedded Linux should be driving the NAND directly rather than accepting the MMC "pretend to be spinning rust" approach, with all the disadvantages and unreliability that it brings.
Posted Oct 10, 2012 8:23 UTC (Wed)
by jpfrancois (subscriber, #65948)
[Link] (10 responses)
Back to F2FS, What you are saying is that :
Posted Oct 10, 2012 8:47 UTC (Wed)
by dwmw2 (subscriber, #2063)
[Link] (3 responses)
Posted Oct 10, 2012 13:48 UTC (Wed)
by arnd (subscriber, #8866)
[Link] (2 responses)
1. A device that remaps physical erase blocks to virtual erase blocks of the same size and can write into a limited number of them. This covers 99% of the SD cards and USB sticks, as well as the majority of the existing eMMC devices. f2fs should work great (i.e. much better than any other fs we have available) on these devices, as long as the number of open erase blocks in the fs (at most 6) does not exceed the number of blocks the device can handle (vendor specific, somewhere between 1 and 20 normally) and the erase block size is known.
2. Upcoming devices that have a very simplistic log-structured file system on them. This covers one SD card I've seen (Samsung 32 GB Class 10) as well as the majority of new eMMC devices. These will work somewhat better with existing file systems for many workloads but worse in the worse-case workloads. If f2fs uses the eMMC-4.5 "large unit contexts", they will work as good or better as the first class, because that should reliably prevent getting into the slow-path that we see normally when a log-structured device gets into GC.
I can't think of a workload or hardware in which f2fs would not theoretically outperform ext3 or most other file systems we have.
Posted Oct 10, 2012 13:59 UTC (Wed)
by jpfrancois (subscriber, #65948)
[Link] (1 responses)
Posted Oct 10, 2012 19:01 UTC (Wed)
by arnd (subscriber, #8866)
[Link]
Posted Oct 10, 2012 12:02 UTC (Wed)
by cgrey8 (guest, #87131)
[Link] (5 responses)
2. Raw NAND required both hardware ECC correction in your microcontroller (not a problem for the ARM Cortex based TI AM335x we plan to use), but also requires a FS to handle the wear leveling. While UBIFS and others are probably decent at that, the act of doing that in software has been shown to take up considerable processing power on the processor we've selected.
In contrast, eMMC abstracts many details away from us and our processor so we don't have to deal quite so directly with them nor take the performance hit. But we aren't so naive to believe that the issues are gone. We are still in the very early stages of R&D on this project, so we have the opportunity to architect our firmware in a way that works best with the eMMC. What we don't have a solid handle on is what exactly that will require of us.
We have identified we have both data that doesn't change often (Uboot, Linux Kernel, most of the File system), but then there's other aspects to what we do that will write data far more than it'll be written for the purposes of trend logging. This would suggest breaking the eMMC into 2 partitions, one dedicated to the stuff that doesn't change often, and the other with LOTS of free space and configured for high durability (pseudo-SLC mode). But this is relatively high level. We'd like to know more about the best practices AND worst practices when working with eMMC in an embedded Linux environment.
TI's Arago Linux is up to 3.2.x and will likely be 3.4.x by the time we actually deploy the product to market, so getting things like TRIM, I would assume are a given. Is that a bad assumption? If so, we will need to look into reconfiguring the kernel build to include it if TRIM is a commonly excluded feature in embedded Linux kernel builds.
But more to the point of this thread, the fact that this new FS from Samsung was designed by makers of eMMC to be "flash friendly" is inviting. But before we jump on a band-wagon, I, like others on this site, would like more details on exactly what this new FS does differently from others (i.e. ext4) to improve the performance and reduce write amplification of random-write accesses to eMMC including what "tunable" parameters are there available to us? And how would we know when we would need to tinker with them?
Posted Oct 10, 2012 14:49 UTC (Wed)
by dwmw2 (subscriber, #2063)
[Link] (4 responses)
1. Raw NAND changes a lot requiring near constant requalifying of parts as vendors change their manufacturing processes.
At least with NAND devices, the issue is mostly just that your board manufacturer finds alternative suppliers for the "same" device so you need to requalify timings, etc. Unless you have a particularly recalcitrant ODM, you should at least get some notice of that kind of thing rather than finding it out post-build-run.
I know there are some people claiming that NAND technology changes too fast, and it's impossible for software to keep up. I'm not sure if that's what you're referring to above, but I'll bring it up so that I can point out that it isn't really true; the people who say this are not speaking from a position of experience. Erase block and page sizes have grown over the years, and the number of writes you can make to a given page has decreased from about 10 writes to a 512-byte page in the 1990s, to a single write (or less than that with paired pages) now. And you need stronger ECC as the underlying technology gets less reliable (especially with MLC). But fundamentally, it's fairly much the same. NAND support was retrofitted to JFFS2 after it had been designed to work with only NOR flash, and the same NAND support is just as valid on today's flash as it was then.
Posted Oct 10, 2012 15:50 UTC (Wed)
by cgrey8 (guest, #87131)
[Link] (3 responses)
We don't have a lot of information on this at all. In fact, all we have is from the TI website. Here's an entry from their Wiki that indicates UBIFS having a significant performance hit on the processor, but doesn't elaborate:
Here's the same wiki lower in the article talking about MMC using ext2 and showing an increase in write bandwidth as well as an order of magnitude less processor load as compared to raw NAND:
The wiki doesn't make specific mention of a difference between eMMC and external SD/MMC although I believe they are actually two different peripherals in the controller (don't quote me on that). Point is, this is where we got our info from.
It is interesting that you mention paired paging. That's new to me. I was aware that partial page writes common in SLC had gone the way of the dodo bird with MLC, but I wasn't aware of them taking it a step further to paired paging. Forgive my ignorance, but what's the difference between paired 4kByte pages and a single 8kByte page size?
It's also beneficial for us to be aware that eMMC parts can change even within the same part number. Do you care to share who the notorious offenders of doing this are? Feel free to email me if you don't care to share publicly (cgrey AT automatedlogic DOT com).
We plan on grossly over-sizing the flash parts relative to the space we plan to consume so we can make heavy use of wear leveling to get the durability we need. But what you are suggesting is we could settle on a part that proves to work well for us because it can be written to with a frequency of say 50k write/erase cycles. But a year later, the same part number only gets us 20k write/erase cycles. I would hope that even if the physical NAND used realizes a decrease of this magnitude that the manufacturer would, behind-the-scenes, add more physical flash to wear level across even though they are only exposing the advertised capacity. Even if they did this, I can see how that could change the realized durability of the part. Thus it would be nice if they notified customers of these type changes and marked/versioned their parts so identified problems could be bound to a specific batch for recall purposes to our customers.
There's at least 1 eMMC vendor out there that is doing something rather interesting in their eMMC parts. They have a cache area used specifically for random write accesses. The cache consists of high durability true SLC and the internal flash manager, presumably uses this like a journaling area to avoid write amplification since they have partial page writes all throughout the cache area. If we were to go with that vendor, would having that internal architecture play into which FS would work best? Or do these type architectures fall into the category of gimmicks that look great in sales pitches, but don't really help much in practice?
Based only on what I know so far, I have to believe that any improvements made to create a "flash friendly" file system relate to compacting small files that originally got written to multiple pages down to consume a single page when the part wear levels. That doesn't help you on the initial write amplification, but it can at least reduce/mitigate the amplification of future writes performed due to wear leveling. Or is this already done in other FSs?
Posted Oct 10, 2012 17:57 UTC (Wed)
by dwmw2 (subscriber, #2063)
[Link] (2 responses)
The benchmarks you link to seem to indicate that DMA isn't being used; it's doing PIO instead. It's not clear why they'd do that. Is this a legacy platform? I'm not familiar with AM335x... what NAND driver does it use? I'm not surprised it uses excessive CPU time if it's doing everything with PIO. If you run your MMC controller in PIO mode it'll be slow too.
I don't have specific examples of changing devices without changing model numbers, but I've heard it repeatedly over the years from people analysing such devices. Personally, I've mostly focused on real flash and left MMC to other people.
Paired pages... when they put 2 bits per flash cell in MLC, you'd have hoped that those two bits would be in the *same* logical page, right? So they get programmed at the same time? But no. They are in *different* logical pages, just to make life fun for you. And they aren't even *adjacent* pages; they can often pair page 0 and 3, page 1 and 2 or something like that. $DEITY knows why.... actually ISTR it's for speed; programming both bits at once would be slower in the microbenchmarks.
Posted Oct 10, 2012 18:13 UTC (Wed)
by cgrey8 (guest, #87131)
[Link] (1 responses)
Posted Oct 10, 2012 18:41 UTC (Wed)
by dwmw2 (subscriber, #2063)
[Link]
Posted Oct 11, 2012 1:44 UTC (Thu)
by zhlyp (guest, #87147)
[Link]
Samsung's F2FS filesystem
Samsung's F2FS filesystem
Samsung's F2FS filesystem
Samsung's F2FS filesystem
What do you mean by FTL ?
Flash translation layer.Does Flash Transition Layer includes wear leveling management for you?
The role of F2FS is not very clear from the announcement. How can you be "Nand Friendly" if you work on a system were the "FTL + wear leveling handling" does everything to hide the Nand ?
Maybe there are some well known limitations to the FTL found in the typical SD card, but I don't know them, so it is not clear to me how F2FS happens to be Nand friendly, when precisely the nand characteristic are carefully hidden by the SD card embedded controller.
Samsung's F2FS filesystem
"It's impossible to hide them. The fact that flash erase block are much higher than write blocks is central feature of flash and it can not be ever fully hidden. Any somewhat sane FTS will handle sequential writes just fine and random writes worse. To the degree that it's[sic] speed will drop 10 or 100 times beyond certain threshold. Random reads are fine. These two fundamental, unfixable characteristics are enough to design a filesystem."
Not really. The FTL is basically a complete file system of its own, and like "normal" file systems, their designs and on-medium format can vary wildly. You have no idea how a given FTL will handle sequential vs. non-sequential writes. Some of them are basically log-structured and don't care a jot about the difference. New writes can just be thrown out to the next available area of flash as they arrive, regardless of the logical sector number which is being written. And there's a boatload of RAM for a lookup table to make writes go fast. Power-up is slow, but hey, that's why we have the SATA 'Device Sleep' work to essentially allow the drives to suspend-to-RAM and not have to build the whole table again when they wake up.
It's an attempt to work around the fact that the FTL found in devices like eMMC is generally really low quality.
Samsung's F2FS filesystem
Samsung's F2FS filesystem
eMMC + F2FS might have decent and reproducible performance, but another vendor SD + VFAT optimised FTL + F2FS might not bring the same benefits ?
High-capacity SSD-type devices have more than one NAND chip. Yes, the capacity of a single chip can be lower than that of an array of them. And ECC is mostly done in hardware these days; it's harder to make stupid and conflicting design decisions in different pieces of the software. Although some people can always manage... ☺
Samsung's F2FS filesystem
"eMMC + F2FS might have decent and reproducible performance, but another vendor SD + VFAT optimised FTL + F2FS might not bring the same benefits ?"
I wouldn't go that far. I'd say that F2FS might have decent performance on the specific devices that it's been optimised for — but another device, even from the same vendor and even the same model number, in the future may behave entirely differently. And all your optimisations^Hlayering violations might be counter-productive.
Samsung's F2FS filesystem
Samsung's F2FS filesystem
Samsung's F2FS filesystem
Samsung's F2FS filesystem
1. Raw NAND changes a lot requiring near constant requalifying of parts as vendors change their manufacturing processes
Samsung's F2FS filesystem
"What we've heard is:
There's a certain amount of that, but not nearly as much as with eMMC devices. As I mentioned elsewhere, those have been known to change the whole of their internals — both the FTL implementation and all of the hardware, microcontroller and all — without even changing the model number. It's just completely a different device, with completely different performance and reliability characteristics, from one batch to the next. But because it's still a black box which pretends to be a 16GiB 'disk', they don't see the need to change the model number.
"2. Raw NAND required both hardware ECC correction in your microcontroller (not a problem for the ARM Cortex based TI AM335x we plan to use), but also requires a FS to handle the wear leveling. While UBIFS and others are probably decent at that, the act of doing that in software has been shown to take up considerable processing power on the processor we've selected."
Most decent SoCs have hardware ECC support on the NAND controller so that shouldn't suck CPU time. And with DMA support, even garbage collection shouldn't take time. If you have specific examples of UBIFS taking "considerable processing power" it'd be interesting to analyse them and see where it's being spent. Note that it'll be garbage collection, not wear levelling per se, which takes the time. Wear levelling is mostly just a matter of choosing which victim blocks to garbage collect from; in itself it isn't time-consuming. Occasionally it means you choose a completely clean block as a victim, and move its entire contents to another physical block. But that should be rare.
Samsung's F2FS filesystem
http://processors.wiki.ti.com/index.php/AM335x-PSP_04.06....
http://processors.wiki.ti.com/index.php/AM335x-PSP_04.06....
Samsung's F2FS filesystem
Samsung's F2FS filesystem
Samsung's F2FS filesystem
Samsung's F2FS filesystem