LWN: Comments on "JFFS2, UBIFS, and the growth of flash storage"

missed JFFS2's Erase Block Summary (EBS) feature ?

vapier — Tue, 26 Feb 2013 06:02:11 +0000

JFFS2 has had Erase Block Summary (EBS) support for quite a long time which drastically speeds up mount time:
http://www.linux-mtd.infradead.org/doc/jffs2.html

some actual performance numbers shows this can easily be a 6x speed increase:
http://docs.blackfin.uclinux.org/doku.php?id=linux-kernel...

covering yaffs would also have been cool :).

i wonder if the changes you made to block2mtd result in numbers that are really comparable. by faking out the erase steps (which in a real flash is not free -- erasing tends to be the slowest operation), ubifs is no longer resilient to power losses right ? unlike the others which would be able to recover. so you've given a nice speed increase to ubifs w/out any such grant to the others. or am i missing something obvious ?

JFFS2, UBIFS, and the growth of flash storage

meuh — Thu, 17 Jan 2013 11:21:46 +0000

for a PC-style file attribute table
erk !
This is Disk Operating System (DOS) filesystem style if you want, but not tied to a particuliar hardware.

JFFS2, UBIFS, and the growth of flash storage

oak — Fri, 11 Jan 2013 11:38:47 +0000

This article doesn't mention which compression algorithm was used with UBIFS.

Compression algorithm has a large effect on the speed. For example LZO is very fast at uncompressing and compression is also fast, although that depends on what LZO compression level is used.

Whereas ZLIB compression is significantly slower for both, but provides better compression results.

Which one to choose depends on:
* What kind of data is being stored
* Flash write and read speeds
* Algorithm compression and uncompression speed compared to those (which depends on CPU speed)

Article mentions using high speed PC, so for it algorithm with best compression ratio would probably be best. On Nokia's Maemo devices low level LZO compression was used at run-time, but pre-made root file systems were compressed using highest LZO compression level (this explains why apt-get upgrade could leave less disk space although binaries were same size).

As to data types, non-compressable data like music, videos etc is typically user data and it can even be on a separate partition / storage media (on Maemo devices, SD card) with different file system, than the root file system which contains the binary data, logs etc.

JFFS2, UBIFS, and the growth of flash storage

marcH — Sat, 15 Dec 2012 23:26:20 +0000

> Getting raw access to flash devices (bypassing vendor FTLs) is indeed the best possible scenario, because this gives way to open development of reliable and high-performance algorithms to manage those.

With current flash technologies and their rate of progress this would practically require writing one different driver per flash chip.

Maybe what's required is something in the middle: some kind of new, more evolved standard interface, something block+page based?

Anyone having worked on MTD should know.

JFFS2, UBIFS, and the growth of flash storage

wookey — Fri, 14 Dec 2012 18:25:07 +0000

Yes, it has that strange business model where most users use it for free under the GPL, but some (largely the ones that actually pay money and make the business viable use) it in other contexts (bootloaders, proprietary OSes, in-house 'stuff'). GPL users sometimes pay for enhancements too, and it was GPL users that paid for most of the initial development.

Yes this wasn't really discussed as part of the mainlining - as it's not technical and thus not really very relevant. The point was made that the FS wasn't only used with Linux and keeping it working for the others (from a single, or at least very similar, codebase) was important.

It's not quite as black and white as a "a choice between mainlining it and not continuing to make a living", it was just that there was a point beyond which the advantages of mainlining (saving maintenance effort, wider exposure, making life a bit easier for linux users) were not sufficient to justify the disadvantages (extra maintenance effort due to divergence) from the author's POV. He decided he'd given it his best shot and been rebuffed. He's not a pushy guy.

It seems no-one else has cared enough to try again in the couple of years since then, probably at least partly because they don't feel they have the moral right to do that.

JFFS2, UBIFS, and the growth of flash storage

plougher — Fri, 14 Dec 2012 17:13:19 +0000

Your comment that "Yaffs continues in its little niche making a living for a couple of people" now helps me better understand why in the mainlining process there was such insistence on behalf of the author to keep the support for other OSes.

Dare I say it but the clear implication is Yaffs makes money but only on the other OSes, and the insistence by the "kernel people" to loose the other OS support was forcing a choice between mainlining it and not continuing to make a living, or making a living by keeping it out of mainline.

Strangely I cannot recall that point being made in the mainlining process? Was it made and I missed it, or was it felt inappropriate and unlikely to further the mainlining process? Either way, I now think the author make the "right choice" in not continuing to mainline it.

I like YAFFS and it was the first workable flash filesystem I used in ~ 2002 back at a time when JFFS2 was worse than useless. The fact YAFFS tends to get written out of Linux kernel history, and the abortive mainlining process doesn't tend to show the kernel community in much glory.

JFFS2, UBIFS, and the growth of flash storage

wookey — Thu, 13 Dec 2012 18:59:23 +0000

Nice article. It would have been interesting to include YAFFS in this comparison too as it too was also written to overcome some of the limitations of JFFS2 (with respect to NAND), starting few years before UBI. I know it never made it to mainline despite efforts to do that in 2010/2011 (http://linux.derkeiler.com/Mailing-Lists/Kernel/2011-01/m...), but it has been quite widely used, especially in early android releases, and is still is. The differences between it and JFFS2 and UBIFS are interesting. It is probably true that it offers no real advantages over UBI any more (unless you want to use it with not-linux), but it was fastest in the last set of benchmarks I saw a couple of years back.

The tale of its development, the mainlining attempt and why ultimately it failed, and its continued existence in its little niche, making a living for a couple of people, is interesting in itself.

Ultimately the problem was that the kernel people wouldn't take anything less than a rewrite to exclusively use standard kernel features, but the author, who still needed to support it on other OSes, wasn't prepared to remove the compatibility features that made that work. Nearly everything could be munged to satisfy both sides, but a few things were sticking pints. It didn't seem to be possible to reach agreement without forking the codebases and no-one really wanted to do that.

JFFS2, UBIFS, and the growth of flash storage

yoush — Thu, 13 Dec 2012 18:39:17 +0000

When running ubi over block2mtd over FTLed flash, FTL in flash still operates, and affects both performance and reliability (i.e. wear-leveling).

UBI guaranteed wear-leveling is effectively turned into random thing.

FTLs are known to start badly misbehave after some time if they don't get information about which blocks are free. So unless TRIM commands are used, "ubi over block2mtd over FTLed flash" will badly degrade in time.

Getting raw access to flash devices (bypassing vendor FTLs) is indeed the best possible scenario, because this gives way to open development of reliable and high-performance algorithms to manage those.

JFFS2, UBIFS, and the growth of flash storage

arnd — Wed, 12 Dec 2012 21:14:09 +0000

Thanks for the new data point. Running with active_logs=4 obviously adds some overhead in the file system because the f2fs garbage collection becomes less efficient and it has to rewrite stuff more. It's not clear whether we get into the case I described but I think you have shown that the extra overhead in the file system is larger than what we save in the device.

I agree on the read numbers, they are probably just in the noise because in theory there is no difference at all based on the mount option.

One thing that would make a very significant difference though is whether the file system is aged and how full it is, but that is true for all of the tests you did.

JFFS2, UBIFS, and the growth of flash storage

arnd — Wed, 12 Dec 2012 21:08:17 +0000

I believe all of the interesting block devices for this (SD, CF, eMMC, USB) are not actually asynchronous, unlike modern SSDs that would not benefit as much because they have less leaky abstractions and don't require you to write on erase block boundaries for best performance.

TRIM?

arnd — Wed, 12 Dec 2012 21:05:07 +0000

I think the cards can either return all-zero or all-one but have to report in the configuration registers which of the two they do. Of course, you could
in theory reverse all bits in software to get the behavior you want, but that has a nonzero performance impact.

Note that sending the erase command to the SD card can also help performance as it might avoid expensive garbage collection, aside from being faster than writes.

JFFS2, UBIFS, and the growth of flash storage

dedekind — Wed, 12 Dec 2012 15:46:45 +0000

Nail, thanks for an interesting article. You are right about buds. We thought that "bud" would be and obvious and self-describing terminology, but it apparently is not that obvious :-)

And yes, we really did not target block devices, but only raw flashes. There was a project to try UBIFS on block devices. Using it "as-is" will of course sucks, because UBI cannot really utilize the asynchronous I/O of the block layer. This is fixable though, but needs some work. I think the benchmark results would be a lot better in that case.

TRIM?

yann.morin.1998 — Wed, 12 Dec 2012 13:32:03 +0000

> When you read from a region that was TRIMed the result is either undefined or all-zeros, whereas when you read from a region that was mtd_erase()d, the result is all-ones.
> So it wouldn't really be useful to make mtd_erase() to TRIM. They do seem similar but they have quite different semantics.

What about combining your catch-erased-sections-and-return-0xFF with TRIMing the underlying storage (if it supports TRIMing)?

Regards,
Yann E. MORIN.

TRIM?

sperl — Wed, 12 Dec 2012 13:18:08 +0000

As far as I remember SD-Card Specs, it is not necessarily defined, that SD cards always have to return 0xff for erased (=trimmed) blocks...

At least it mention that the behavior may depend on the type of technology (NAND/NOR/...) that is used on the HW-level.

So the behavior of expected return may also be open to implementation for SSDs... (I have not read the spec there though)

JFFS2, UBIFS, and the growth of flash storage

nhippi — Wed, 12 Dec 2012 07:38:20 +0000

I hope there would some effort to create an "MMC-direct" extension to the MMC/SD standards, allowing bypassing the FTL layer. Or at least giving the erase block sizes and other bits of information needed to to tune the filesystem to work on it optimally.

This would be especially useful for the eMMC storages that are soldered on board, and thus don't need FAT to be compatible to the world.

JFFS2, UBIFS, and the growth of flash storage

dgc — Wed, 12 Dec 2012 05:30:59 +0000

Interesting article, Neil. :)

When I see stuff like this, however:

> For testing I used a new class 10 16GB microSD card, which claims 10MB/s
> throughput and seems to provide close to that for sequential IO.
> According to the flashbench tool, the card appears to have an 8MB erase
> block size; five erase blocks can be open at a time,

I always wonder how well using XFS and tuning it's geometry to the flash characteristics would work. E.g. use a single stripe unit of the erase block size (8MB in this case) to align fixed metadata and large file allocation to 8MB boundaries. Then setting the number of AGs equal to the number of open erase blocks at a time (5 in this case) gives an appropriate number separate regions of activity in the filesystem to distribute the write loads.

And then there's the dynamic inode allocation, which means inodes are also allocated in the same general locality as the parent directory blocks and their file data.

It seems like these feature would provide are similar behaviours to what filesystems specifically designed for flash use, so I've always been curious as to whether it would make any significant difference to performance on a simple flash device like the above one you tested with...

-Dave.

TRIM?

neilbrown — Wed, 12 Dec 2012 01:22:58 +0000

When you read from a region that was TRIMed the result is either undefined or all-zeros, whereas when you read from a region that was mtd_erase()d, the result is all-ones.

So it wouldn't really be useful to make mtd_erase() to TRIM. They do seem similar but they have quite different semantics.

JFFS2, UBIFS, and the growth of flash storage

neilbrown — Wed, 12 Dec 2012 01:18:54 +0000

Thanks for the suggestion. I hadn't used the active_logs mount option. I just ran my script with that option added and it didn't make much difference.

The numbers I get for the original and the active_logs=4 runs are:

f2fs-default-2:
  write kernel 113.738 121.853 118.412
  read kernel 150.369 270.465 175.724
  du -s kernel 48.393 48.908 48.6091
  rm -r kernel 0.333 0.384 0.36
  write files 837.503
  read files 571.196
f2fs-active_logs:
  write kernel 111.966 120.791 116.571
  read kernel 148.364 238.796 163.316
  du -s kernel 48.111 49.623 49.1534
  rm -r kernel 0.335 0.365 0.3489
  write files 1190.29
  read files 563.56

Where there are 3 numbers they are min/max/mean of 10 runs.

Reading small files seems faster, but the numbers were already noisy - about half the individual results were within 5 seconds of the minimum. which is much the same in both cases.

The write-large-files test is quite a bit slower. I probably need to do a couple more runs before I know what that means.

So it looks like I wasn't hitting the possible too-many-erase-blocks-open case in this test.

JFFS2, UBIFS, and the growth of flash storage

masoncl — Wed, 12 Dec 2012 00:17:50 +0000

Really interesting Neil, thanks. It's worth pointing out that btrfs by default will duplicate metadata. So we're doing 2x the IO on metadata updates. mkfs.btrfs -m single will turn that off, and it should get our delete times closer to ext4.

TRIM?

cibyr — Tue, 11 Dec 2012 23:34:22 +0000

Could block2mtd translate mtd_erase() to TRIM commands where appropriate?

JFFS2, UBIFS, and the growth of flash storage

arnd — Tue, 11 Dec 2012 23:06:45 +0000

Thank you very much for yet another very interesting article on this topic!

One question: Since the SD card you measured can support only 5 erase blocks being written concurrently, did you mount f2fs using the "active_logs=4" option? With the default of 6 active logs plus another erase block being used for global metadata, you might otherwise get into a situation where you alternate between 7 blocks and the card needs to constantly garbage-collect.