JFFS2, UBIFS, and the growth of flash storage

December 11, 2012

This article was contributed by Neil Brown

When thinking about filesystems for modern flash storage devices, as we have recently done with f2fs and NILFS2, two other filesystems that are likely to quickly spring to mind, and be almost as quickly discarded, are JFFS2 and UBIFS. They spring to mind because they were designed specifically to work with flash, and are discarded because they require access to "raw flash" whereas the flash devices we have been considering have a "flash translation layer" (FTL) which hides some of the details of the flash device and which needs to be accessed much like a disk drive.

This quick discarding may well not be appropriate — these are open-source filesystems after all and are thus free to be tinkered with. If the Apollo 13 technicians were able to link the Lithium hydroxide canisters from the command module to the CO₂ scrubber in the Lunar module, it shouldn't be too hard for us to link a raw-flash filesystem to a FTL based storage chip — if it seemed like a useful thing to do.

Raw access to a flash device goes through the "mtd" (Memory Technology Devices) interface in Linux and, while this is a rich interface, the vast majority of accesses from a filesystem are via three functions: mtd_read(), mtd_write() and mtd_erase(). The first two are easily implemented by a block device — though you need to allow for the fact that the mtd interface is synchronous while the block layer interface is asynchronous — and the last can be largely ignored as an FTL handles erasure internally. In fact, Linux provides a "block2mtd" device which will present an arbitrary block device as an mtd device. Using this might not be the most efficient way to run a filesystem on new hardware, but it would at least work as a proof-of-concept.

So it seems that there could be some possibility of using one of these filesystems, possibly with a little modification, on an FTL-based flash device, and there could certainly be value in understanding them a little better as, at the very least, they could have lessons to teach us.

A common baseline

Despite their separate code bases, there is a lot of similarity between JFFS2 and UBIFS — enough that it seems likely that the latter was developed in part to overcome the shortcomings of the former. One similarity is that, unlike the other filesystems we have looked at, neither of these filesystems has a strong concept of a "basic block size". The concept is there if you look for it, but it isn't prominent.

One of the main uses of a block size in a filesystem is to manage free space. Some blocks are in use, others are free. If a block is only partially used — for example if it contains the last little bit of a file — then the whole block is considered to be in use. For flash filesystems, blocks are not as useful for free-space management as this space is managed in terms of "erase blocks," which are much larger than the basic blocks of other filesystems, possibly as large as a few megabytes. Another use of blocks in a filesystem is as a unit of metadata management. For example NILFS2 manages the ifile (inode file) as a sequence of blocks (rather than just a sequence of inodes), while F2FS manages each directory as a set of hash tables, each of which contains a fixed number of blocks.

JFFS2 and UBIFS don't take this approach at all. All data is written consecutively to one or more erase blocks with some padding to align things to four-byte boundaries, but with no alignment so large that it could be called a block. When indexing of data is needed, an erase-block number combined with a byte offset meets the need, so the lack of alignment does not cause an issue there.

Both filesystems further make use of this freedom in space allocation by compressing the data before it is written. Various compression schemes are available including LZO and ZLIB together with some simpler schemes like run-length encoding. Which scheme is chosen depends on the desired trade off between space saving and execution time. This compression can make a small flash device hold nearly twice as much as you might expect, depending on the compressibility of the files of course. Your author still recalls the pleasant surprise he got when he found out how much data would fit on the JFFS2 formatted 256MB flash in the original Openmoko Freerunner: a reasonably complete Debian root filesystem with assorted development tools and basic applications still left room for a modest amount of music and some OSM map tiles.

In each case, the data and metadata of the filesystem are collected into "nodes" which are concatenated and written out to a fresh erase block. Each node records the type of data (inode, file, directory name, etc), the address of the data (such as inode number), the type of compression and a few other details. This makes it possible to identify the contents of the flash when mounting and when cleaning, and effectively replaces the "segment summary" that is found in f2fs and NILFS2.

Special note should be made of the directory name nodes. While the other filesystems we have studied store a directory much like a file, with filenames stored at various locations in that file, these two filesystems do not. Each entry in the directory is stored in its own node, and these nodes do not correspond to any particular location in a "file" — they are simply unique entries. JFFS2 and UBIFS each have their own particular way of finding these names as we shall see, but in neither case is the concept of a file offset part of that.

The one place where a block size is still visible in these filesystems is in the way they chop a file up into nodes for storage. In JFFS2, a node can be of any size up to 4KB so a log file could, for example, be split up as one node per line. However the current implementation always writes whole pages — to quote the in-line commentary, "It sucks, but it's simple". For UBIFS, data nodes must start at a 4KB-aligned offset in the file so they are typically 4KB in size (before compression) except when at the end of the file.

JFFS2 — the journaling flash filesystem

A traditional journaling filesystem, such as ext3 or xfs, adds a journal to a regular filesystem. Updates are written first to the journal and then to the main filesystem. When mounting the filesystem after a shutdown, the journal is scanned and anything that is found is merged into the main filesystem, thus providing crash tolerance. JFFS2 takes a similar approach with one important difference — there is no "regular filesystem". With JFFS2 there is only a journal, a journal that potentially covers the entire device.

It is probably a little misleading to describe JFFS2 as "just one journal". This is because it might lead you to think that when it gets to the end of the journal it just starts again at the beginning. While this was true of JFFS1, it is not for JFFS2. Rather it might be clearer to think of each erase block as a little journal. When one erase block is full, JFFS2 looks around for another one to use. Meanwhile if it notices that some erase blocks are nearly empty it will move all the active nodes out of them into a clean erase block, and then erase and re-use those newly-cleaned erase blocks.

When a JFFS2 filesystem is mounted, all of these journals, and thus the entire device, are scanned and every node found is incorporated into an in-memory data structure describing the filesystem. Some nodes might invalidate other nodes; this may happen when a file is created and then removed: there will be a node recording the new filename as belonging to some directory, and then another node recording that the filename has been deleted. JFFS2 resolves all these modifications and ends up with a data structure that describes the filesystem as it was that last time something was written to it, and also describes where the free space is. The structure is kept as compact as possible and naturally does not contain any file data; instead, it holds only the addresses where the data should be found and so, while it will be much smaller than the whole filesystem, it will still grow linearly as the filesystems grows.

This need to scan the entire device at mount time and store the skeleton of the filesystem in memory puts a limit on the size of filesystem that JFFS2 is usable for. Some tens of megabytes, or even a few hundred megabytes, is quite practical. Once the device gets close to, or exceeds, a gigabyte, JFFS2 become quite impractical. Even if memory for storing the tree were cheap, time to mount the filesystem is not.

This is where UBIFS comes in. While the details are quite different, UBIFS is a lot like JFFS2 with two additions: a tree to index all the nodes in the filesystem, and another tree to keep track of free space. With these two trees, UBIFS avoids both the need to scan the entire device at mount time and the need to keep a skeleton of the filesystem in memory at all times. This allows UBIFS to scale to much larger filesystems — certainly many tens of gigabytes and probably more.

But before we look too closely at these trees it will serve us well to look at some of the other details and in particular at "UBI", a layer between the MTD flash interface layer and UBIFS. UBI uses an unsorted collection of flash erase blocks to present a number of file system images; UBI stands for Unsorted Block Images.

UBI — almost a Flash Translation Layer

The documentation for UBI explicit states that it is not a flash translation layer. Nonetheless it shares a lot of functionality with an FTL, particularly wear leveling and error management. If you imagined UBI as an FTL where the block size was the same as the size of an erase block, you wouldn't go far wrong.

UBI uses a flash device which contains a large number of Physical Erase Blocks (PEBs) to provide one or more virtual devices (or "volumes") which each consist of a smaller number of Logical Erase Blocks (LEBs), each slightly smaller than a PEB. It maintains a mapping from LEB to PEB and this mapping may change from time to time due to various causes including:

Writing to an LEB. When an LEB is written, the data will be written to a new, empty, PEB and the mapping from LEB to PEB will be updated. UBI is then free to erase the old PEB at its leisure. Normally, the first new write to an LEB will make all the data previously there inaccessible. However, a feature is available where the new PEB isn't committed until the write request completes. This ensures that after a sudden power outage, the LEB will either have the old data or the complete new data, never anything else.
Wear leveling. UBI keeps a header at the start of each PEB which is rewritten immediately after the block is erased. One detail in the header is how many times the PEB has been written and erased. When UBI notices that the difference between the highest write count and the lowest write count in all the PEBs gets too high (based on a compile-time configuration parameter: MTD_UBI_WL_THRESHOLD), it will move an LEB stored in a PEB with a low write count (which is assumed to be stable since the PEB containing it has not been rewritten often) to one with a high write count. If this data continues to be as stable as it has been, this will tend to reduce the variation among write counts and achieve wear leveling.
Scrubbing. NAND flash includes an error-correcting code (ECC) for each page (or sub-page) which can detect multiple-bit errors and correct single-bit errors. When an error is reported while reading from a PEB, UBI will relocate the LEB in that PEB to another PEB so as to guard against a second bit error, which would be uncorrectable. This process happens transparently and is referred to as "scrubbing".

The functionality described above is already an advance on the flash support that JFFS2 provides. JFFS2 does some wear leveling but it is not precise. It keeps no record of write counts but, instead, decides to relocate an erase-block based on the roll of a dice (or actually the sampling of a random number) instead. This probably provides some leveling of wear, but there are no guarantees. JFFS2 also has no provision for scrubbing.

The mapping from PEB to LEB is stored spread out over all active erase blocks in the flash device. After the PEB header that records the write count there is a second header which records the volume identifier and LEB number of the data stored here. To recover this mapping at mount time, UBI needs to read the first page or two from every PEB. While this isn't as slow as reading every byte like JFFS2 has to, it would still cause mount time to scale linearly with device size — or nearly linearly as larger devices are likely to have larger erase block sizes.

Recently this situation has improved. A new feature known has "fastmap" made its way into the UBI driver for Linux 3.7. Fastmap stores a recent copy of the mapping in some erase block together with a list of the several (up to 256) erase blocks which will be written next, known as the pool. The mount process then needs to examine the first 64 PEBs to find a "super block" which points to the mapping, read the mapping, and then read the first page of each PEB in the pool to find changes to the mapping. When the pool is close to exhaustion, a new copy of the mapping with a new list of pool PEBs is written out. This is clearly a little more complex, but puts a firm cap on the mount time and so ensures scalability to much larger devices.

UBIFS — the trees

With UBIFS, all the filesystem content — inodes, data, and directory entries — is stored in nodes in various arbitrary Logical Erase Blocks, and the addresses of these blocks are stored in a single B-tree. This is similar in some ways to reiserfs (originally known as "treefs") and Btrfs, and contrasts with filesystems like f2fs, NILFS2 and ext3 where inodes, file data, and directory entries are all stored with quite different indexing structures.

The key for lookup in this B-tree is 64 bits wide, formed from a 32-bit inode number, a three-bit node type, and a 29-bit offset (for file data) or hash value (for directory entries). This last field, combined with a 4KB block size used for indexing, limits the size of the largest file to two terabytes, probably the smallest limit in the filesystem.

Nodes in this B-tree are, like other nodes, stored in whichever erase block happens to be convenient. They are also like other nodes in that they are not sized to align with any "basic block" size. Rather the size is chosen based on the fan-out ratio configured for the filesystem. The default fan-out is eight, meaning that each B-tree node contains eight keys and eight pointers to other nodes, resulting in a little under 200 bytes per node.

Using small nodes means that fewer bytes need to be written when updating indexes. On the other hand, there are more levels in the tree so more reading is likely to be required to find a node. The ideal trade off will depend on the relative speeds of reads and writes. For flash storage that serves reads a lot faster than writes — which is not uncommon, but seemingly not universal — it is likely that this fan-out provides a good balance. If not, it is easy to choose a different fan-out when creating a filesystem.

New nodes in the filesystem do not get included in the indexing B-tree immediately. Rather, their addresses are written to a journal, to which a few LEBs are dedicated. When the filesystem is mounted, this journal is scanned, the nodes are found, and based on the type and other information in the node header, they are merged into the indexing tree. This merging also happens periodically while the filesystem is active, so that the journal can be truncated. Those nodes that are not yet indexed are sometimes referred to as "buds" — a term which at first can be somewhat confusing. Fortunately the UBIFS code is sprinkled with some very good documentation so it wasn't too hard to discover that "buds" were nodes that would soon be "leaves" of the B-tree, but weren't yet — quite an apt botanical joke.

Much like f2fs, UBIFS keeps several erase blocks open for writes at the same time so that different sorts of data can be kept separate from each other, which, among other things, can improve cleaning performance. These open blocks are referred to as different "journal heads". UBIFS has one "garbage collection" head where the cleaner writes nodes that it moves — somewhat like the "COLD" sections in f2fs. There is also a "base" head where inodes, directory entries, and other non-data nodes are written — a bit like the "NODE" sections in f2fs. Finally, there are one or more "data" heads where file data is written, though the current code doesn't appear to actually allow the "or more" aspect of the design.

The other tree that UBIFS maintains is used for keeping track of free space or, more precisely, how many active nodes there are in each erase block. This tree is a radix tree with a fan-out of four. So if you write the address of a particular LEB in base four (also known as radix-four), then each digit would correspond to one level in the tree, and its value indicates which child to follow to get down to the next level.

This tree is stored in a completely separate part of the device with its own set of logical erase blocks, its own garbage collection, and consequently its own table of LEB usage counters. This last table must be small enough to fit in a single erase block and so imposes a (comfortably large) limit on the filesystem size. Keeping this tree separate seems like an odd decision, but doubtlessly simplifies the task of keeping track of device usage. If the node that records the usage of an LEB were to be stored in that LEB, there would be additional complexity which this approach avoids.

A transition to FTL?

While JFFS2 clearly has limits, UBIFS seem to be much less limited. With 32 bits to address erase blocks which, themselves, could comfortably cover several megabytes, the addressing can scale to petabyte devices. The B-tree indexing scheme should allow large directories and large files to work just as well as small ones. The two terabyte limit on individual files might one day be a limit but that still seems a long way off. With the recent addition of fastmap for UBI, UBIFS would seem ready to scale to the biggest flash storage we have available. But it still requires raw flash access while a lot of flash devices force all access to pass through a flash translation layer. Could UBIFS still be useful on those devices?

Given that the UBI layer looks a lot like an FTL it seems reasonable to wonder if UBI could be modified slightly to talk to a regular block device instead, and allow it to talk to an SD card or similar. Could this provide useful performance?

Unfortunately such a conversion would be a little bit more than an afternoon's project. It would require:

Changing the expectation that all I/O is synchronous. This might be as simple as waiting immediately after submitting each request, but it would be better if true multi-threading could be achieved. Currently, UBIFS disables readahead because it is incompatible with a synchronous I/O interface.
Changing the expectation that byte-aligned reads are possible. UBIFS currently reads from a byte-aligned offset into a buffer, then decompresses from there. To work with the block layer it would be better to use a larger buffer that was sector-aligned, and then understand that the node read in would be found at an offset into that buffer, not at the beginning.
Changing the expectation that erased blocks read as all ones. When mounting a filesystem, UBIFS scans various erase blocks and assumes anything that isn't 0xFF is valid data. An FTL-based flash store will not provide that guarantee, so UBIFS would need to use a different mechanism to reliably detect dead data. This is not conceptually difficult but could be quite intrusive to the code.
Finding some way to achieve the same effect as the atomic LEB updates that UBI can provide. Again, a well understood problem, but possibly intrusive to fix.

So without a weekend to spare, that approach cannot be experimented with. Fortunately there is an alternative. As mentioned, there already exists a "block2mtd" driver which can be used to connect UBIFS, via UBI and mtd, to a block device. This driver in deliberately very simple and consequently quite inefficient. For example, it handles the mtd_erase() function by writing blocks full of 0xFF to the device. However, it turns out that it is only an afternoons project to modify it to allow for credible testing.

This patch modifies the block2mtd driver to handle mtd_erase() by recording the location of erased blocks in memory, return 0xFF for any read of an erased block, and not write out the PEB headers until real data is to be written to the PEB. The result of these changes is that the pattern of reads and, more importantly, writes to the block device will be much the same as the pattern of reads and writes expected from a more properly modified UBIFS. It is clearly not useful for real usage as important information is kept in memory, but it can provide a credible base for performance testing.

The obvious choice of what to test it against is f2fs. Having examined the internals of both f2fs and UBIFS, we have found substantial similarity which is hardly surprising as they have both been designed to work with flash storage. Both write whole erase blocks at a time where possible, both have several erase blocks "open" at once, and both make some efforts to collect similar data into the same erase blocks. There are of course differences though: UBIFS probably scales better to large directories, it can compress data being written, and it does not currently support exporting via NFS, partly because of the difficulty of providing a stable index for directory entries.

The compression support is probably most interesting. If the CPU is fast enough, compression might be faster than writing to flash and this could give UBIFS an edge in speed.

I performed some testing with f2fs and UBIFS; the latter was tested twice, with and without the use of compression (the non-compression case is marked below as "NC"). Just for interest's sake I've added NILFS2, ext4 and Btrfs. None of these are particularly designed for FTL based flash, though NILFS2 can align writes with the erase blocks and so might perform well. The results of the last two should be treated very cautiously. No effort was made to tune them to the device used, and all the results are based on writing to an empty device. For f2fs, UBIFS, and NILFS2 we know that they can "clean" the device so they always write to unused erase blocks. ext4 and Btrfs do not do the same cleaning so it is quite possible that the performance will degrade on a more "aged" filesystem. So the real long term values for these filesystems might be better, and might be worse, than what we see here.

For testing I used a new class 10 16GB microSD card, which claims 10MB/s throughput and seems to provide close to that for sequential IO. According to the flashbench tool, the card appears to have an 8MB erase block size; five erase blocks can be open at a time, and only the first erase block optimized for a PC-style file attribute table. The kernel used was 3.6.6 for openSUSE with the above mentioned patch and the v3 release of f2fs.

The tests performed were very simple. To measure small file performance, a tar archive of the Linux kernel (v3.7-rc6) was unpacked ten times and then — after unmounting and remounting — the files were read back in again and "du" and "rm -r" were timed to check metadata performance. The "rm -r" test was performed with a warm cache, immediately after the "du -a", which was performed on a cold cache. The average times in seconds for these operations were:

ubifs ubifs — NC f2fs NILFS2 ext4 Btrfs
Write kernel 72.4 139.9 118.4 140.0 135.5 93.6
Read kernel 72.5 129.6 175.7 95.6 108.8 121.0
du -s 9.9 8.7 48.6 4.4 4.4 13.8
rm -r 0.48 0.45 0.36 11.0 4.9 33.6

	ubifs	ubifs — NC	f2fs	NILFS2	ext4	Btrfs
Write kernel	72.4	139.9	118.4	140.0	135.5	93.6
Read kernel	72.5	129.6	175.7	95.6	108.8	121.0
`du -s`	9.9	8.7	48.6	4.4	4.4	13.8
`rm -r`	0.48	0.45	0.36	11.0	4.9	33.6

Some observations:

UBIFS, with compression, is clearly the winner at reading and writing small files. This test was run on an Intel Core i7 processor running at 1GHz; on a slower processor, the effect might not be as big. Without compression, UBIFS is nearly the slowest, which is a little surprising, but that could be due to the multiple levels that data passes though (UBI, MTD, block2mtd).
f2fs is surprisingly poor at simple metadata access (du -s). It is unlikely that this is due to the format chosen for the filesystem — the indirection of the Node Address Table is the only aspect of the design that could possibly cause this slowdown and it could explain at most a factor of two. This poor performance is probably some simple implementation issue. The number is stable across the ten runs, so it isn't just a fluke.
Btrfs is surprisingly fast at writing. The kernel source tree is about 500MB in size, so this is around 5.5MB/sec, which is well below what the device can handle but is still faster than anything else. This presumably reflects the performance-tuning efforts that the Btrfs team have made.
"rm -r" is surprisingly slow for the non-flash-focused filesystems, particularly Btrfs. The variance is high too. For ext4, the slowest "rm -r" took 32.4 seconds, while, for Btrfs, the slowest was 137.8 seconds — over 2 minutes. This seems to be one area where tuning the design for flash can be a big win.

So there is little here to really encourage spending that weekend to make UBIFS work well directly on flash. Except for the compression advantage, we are unlikely to do much better than f2fs, which can be used without that weekend of work. We would at least need to see how compression performs on the processor found in the target device before focusing too much on it.

As well as small files, I did some even simpler large-file tests. For this, I wrote and subsequently read two large, already compressed, files. One was an mp4 file with about one hour of video. The other was an openSUSE 12.2 install ISO image. Together they total about 6GB. The total times for each filesystem were:

ubifs ubifs — NC f2fs NILFS2 ext4 Btrfs
write files 850 876 838 1522 696 863
read files 1684 1539 571 574 571 613

	ubifs	ubifs — NC	f2fs	NILFS2	ext4	Btrfs
write files	850	876	838	1522	696	863
read files	1684	1539	571	574	571	613

The conclusions here are a bit different:

Now ext4 is a clear winner on writes. It would be very interesting to work out why. The time translates to about 8.8MB/sec which is getting close to the theoretical maximum of 10MB/sec.
Conversely, NILFS2 is a clear loser, taking nearly twice as long as the other filesystems. Two separate runs showed similar results so it looks like there is room for some performance tuning here.
UBIFS is a clear loser on reads. This is probably because nodes are not aligned to sectors so some extra reading and extra copying is needed.
The ability for UBIFS to compress data clearly doesn't help with these large files. UBIFS did a little better with compression enabled, suggesting that the files were partly compressible, but it wasn't enough to come close to f2fs.

In summary, while f2fs appears to have room for improvement in some aspects of the implementation, there seems little benefit to be gained from pushing UBIFS into the arena of FTL-based devices. It will likely remain the best filesystem for raw flash, while f2fs certainly has some chance of positioning itself as the best filesystem for FTL-based flash. However, we certainly shouldn't write off ext4 or Btrfs. As noted earlier, these tests are not expected to give a firm picture of these two filesystems so we cannot read anything conclusive from them. However, it appears that both have something to offer, if only we can find a way to isolate that something.

Index entries for this article
Kernel	Filesystems/Flash
GuestArticles	Brown, Neil

JFFS2, UBIFS, and the growth of flash storage

Posted Dec 11, 2012 23:06 UTC (Tue) by arnd (subscriber, #8866) [Link] (2 responses)

Thank you very much for yet another very interesting article on this topic!

One question: Since the SD card you measured can support only 5 erase blocks being written concurrently, did you mount f2fs using the "active_logs=4" option? With the default of 6 active logs plus another erase block being used for global metadata, you might otherwise get into a situation where you alternate between 7 blocks and the card needs to constantly garbage-collect.

JFFS2, UBIFS, and the growth of flash storage

Posted Dec 12, 2012 1:18 UTC (Wed) by neilbrown (subscriber, #359) [Link] (1 responses)

Thanks for the suggestion. I hadn't used the active_logs mount option. I just ran my script with that option added and it didn't make much difference.

The numbers I get for the original and the active_logs=4 runs are:

f2fs-default-2:
  write kernel 113.738 121.853 118.412
  read kernel 150.369 270.465 175.724
  du -s kernel 48.393 48.908 48.6091
  rm -r kernel 0.333 0.384 0.36
  write files 837.503
  read files 571.196
f2fs-active_logs:
  write kernel 111.966 120.791 116.571
  read kernel 148.364 238.796 163.316
  du -s kernel 48.111 49.623 49.1534
  rm -r kernel 0.335 0.365 0.3489
  write files 1190.29
  read files 563.56

Where there are 3 numbers they are min/max/mean of 10 runs.

Reading small files seems faster, but the numbers were already noisy - about half the individual results were within 5 seconds of the minimum. which is much the same in both cases.

The write-large-files test is quite a bit slower. I probably need to do a couple more runs before I know what that means.

So it looks like I wasn't hitting the possible too-many-erase-blocks-open case in this test.

JFFS2, UBIFS, and the growth of flash storage

Posted Dec 12, 2012 21:14 UTC (Wed) by arnd (subscriber, #8866) [Link]

Thanks for the new data point. Running with active_logs=4 obviously adds some overhead in the file system because the f2fs garbage collection becomes less efficient and it has to rewrite stuff more. It's not clear whether we get into the case I described but I think you have shown that the extra overhead in the file system is larger than what we save in the device.

I agree on the read numbers, they are probably just in the noise because in theory there is no difference at all based on the mount option.

One thing that would make a very significant difference though is whether the file system is aged and how full it is, but that is true for all of the tests you did.

TRIM?

Posted Dec 11, 2012 23:34 UTC (Tue) by cibyr (subscriber, #87609) [Link] (4 responses)

Could block2mtd translate mtd_erase() to TRIM commands where appropriate?

TRIM?

Posted Dec 12, 2012 1:22 UTC (Wed) by neilbrown (subscriber, #359) [Link] (3 responses)

When you read from a region that was TRIMed the result is either undefined or all-zeros, whereas when you read from a region that was mtd_erase()d, the result is all-ones.

So it wouldn't really be useful to make mtd_erase() to TRIM. They do seem similar but they have quite different semantics.

TRIM?

Posted Dec 12, 2012 13:18 UTC (Wed) by sperl (subscriber, #5657) [Link] (1 responses)

As far as I remember SD-Card Specs, it is not necessarily defined, that SD cards always have to return 0xff for erased (=trimmed) blocks...

At least it mention that the behavior may depend on the type of technology (NAND/NOR/...) that is used on the HW-level.

So the behavior of expected return may also be open to implementation for SSDs... (I have not read the spec there though)

TRIM?

Posted Dec 12, 2012 21:05 UTC (Wed) by arnd (subscriber, #8866) [Link]

I think the cards can either return all-zero or all-one but have to report in the configuration registers which of the two they do. Of course, you could
in theory reverse all bits in software to get the behavior you want, but that has a nonzero performance impact.

Note that sending the erase command to the SD card can also help performance as it might avoid expensive garbage collection, aside from being faster than writes.

TRIM?

Posted Dec 12, 2012 13:32 UTC (Wed) by yann.morin.1998 (guest, #54333) [Link]

> When you read from a region that was TRIMed the result is either undefined or all-zeros, whereas when you read from a region that was mtd_erase()d, the result is all-ones.
> So it wouldn't really be useful to make mtd_erase() to TRIM. They do seem similar but they have quite different semantics.

What about combining your catch-erased-sections-and-return-0xFF with TRIMing the underlying storage (if it supports TRIMing)?

Regards,
Yann E. MORIN.

JFFS2, UBIFS, and the growth of flash storage

Posted Dec 12, 2012 0:17 UTC (Wed) by masoncl (subscriber, #47138) [Link]

Really interesting Neil, thanks. It's worth pointing out that btrfs by default will duplicate metadata. So we're doing 2x the IO on metadata updates. mkfs.btrfs -m single will turn that off, and it should get our delete times closer to ext4.

JFFS2, UBIFS, and the growth of flash storage

Posted Dec 12, 2012 5:30 UTC (Wed) by dgc (subscriber, #6611) [Link] (1 responses)

Interesting article, Neil. :)

When I see stuff like this, however:

> For testing I used a new class 10 16GB microSD card, which claims 10MB/s
> throughput and seems to provide close to that for sequential IO.
> According to the flashbench tool, the card appears to have an 8MB erase
> block size; five erase blocks can be open at a time,

I always wonder how well using XFS and tuning it's geometry to the flash characteristics would work. E.g. use a single stripe unit of the erase block size (8MB in this case) to align fixed metadata and large file allocation to 8MB boundaries. Then setting the number of AGs equal to the number of open erase blocks at a time (5 in this case) gives an appropriate number separate regions of activity in the filesystem to distribute the write loads.

And then there's the dynamic inode allocation, which means inodes are also allocated in the same general locality as the parent directory blocks and their file data.

It seems like these feature would provide are similar behaviours to what filesystems specifically designed for flash use, so I've always been curious as to whether it would make any significant difference to performance on a simple flash device like the above one you tested with...

-Dave.

JFFS2, UBIFS, and the growth of flash storage

Posted Dec 12, 2012 7:38 UTC (Wed) by nhippi (subscriber, #34640) [Link]

I hope there would some effort to create an "MMC-direct" extension to the MMC/SD standards, allowing bypassing the FTL layer. Or at least giving the erase block sizes and other bits of information needed to to tune the filesystem to work on it optimally.

This would be especially useful for the eMMC storages that are soldered on board, and thus don't need FAT to be compatible to the world.

JFFS2, UBIFS, and the growth of flash storage

Posted Dec 12, 2012 15:46 UTC (Wed) by dedekind (guest, #32521) [Link] (1 responses)

Nail, thanks for an interesting article. You are right about buds. We thought that "bud" would be and obvious and self-describing terminology, but it apparently is not that obvious :-)

And yes, we really did not target block devices, but only raw flashes. There was a project to try UBIFS on block devices. Using it "as-is" will of course sucks, because UBI cannot really utilize the asynchronous I/O of the block layer. This is fixable though, but needs some work. I think the benchmark results would be a lot better in that case.

JFFS2, UBIFS, and the growth of flash storage

Posted Dec 12, 2012 21:08 UTC (Wed) by arnd (subscriber, #8866) [Link]

I believe all of the interesting block devices for this (SD, CF, eMMC, USB) are not actually asynchronous, unlike modern SSDs that would not benefit as much because they have less leaky abstractions and don't require you to write on erase block boundaries for best performance.

JFFS2, UBIFS, and the growth of flash storage

Posted Dec 13, 2012 18:39 UTC (Thu) by yoush (guest, #38940) [Link] (1 responses)

When running ubi over block2mtd over FTLed flash, FTL in flash still operates, and affects both performance and reliability (i.e. wear-leveling).

UBI guaranteed wear-leveling is effectively turned into random thing.

FTLs are known to start badly misbehave after some time if they don't get information about which blocks are free. So unless TRIM commands are used, "ubi over block2mtd over FTLed flash" will badly degrade in time.

Getting raw access to flash devices (bypassing vendor FTLs) is indeed the best possible scenario, because this gives way to open development of reliable and high-performance algorithms to manage those.

JFFS2, UBIFS, and the growth of flash storage

Posted Dec 15, 2012 23:26 UTC (Sat) by marcH (subscriber, #57642) [Link]

> Getting raw access to flash devices (bypassing vendor FTLs) is indeed the best possible scenario, because this gives way to open development of reliable and high-performance algorithms to manage those.

With current flash technologies and their rate of progress this would practically require writing one different driver per flash chip.

Maybe what's required is something in the middle: some kind of new, more evolved standard interface, something block+page based?

Anyone having worked on MTD should know.

JFFS2, UBIFS, and the growth of flash storage

Posted Dec 13, 2012 18:59 UTC (Thu) by wookey (guest, #5501) [Link] (2 responses)

Nice article. It would have been interesting to include YAFFS in this comparison too as it too was also written to overcome some of the limitations of JFFS2 (with respect to NAND), starting few years before UBI. I know it never made it to mainline despite efforts to do that in 2010/2011 (http://linux.derkeiler.com/Mailing-Lists/Kernel/2011-01/m...), but it has been quite widely used, especially in early android releases, and is still is. The differences between it and JFFS2 and UBIFS are interesting. It is probably true that it offers no real advantages over UBI any more (unless you want to use it with not-linux), but it was fastest in the last set of benchmarks I saw a couple of years back.

The tale of its development, the mainlining attempt and why ultimately it failed, and its continued existence in its little niche, making a living for a couple of people, is interesting in itself.

Ultimately the problem was that the kernel people wouldn't take anything less than a rewrite to exclusively use standard kernel features, but the author, who still needed to support it on other OSes, wasn't prepared to remove the compatibility features that made that work. Nearly everything could be munged to satisfy both sides, but a few things were sticking pints. It didn't seem to be possible to reach agreement without forking the codebases and no-one really wanted to do that.

JFFS2, UBIFS, and the growth of flash storage

Posted Dec 14, 2012 17:13 UTC (Fri) by plougher (guest, #21620) [Link] (1 responses)

Your comment that "Yaffs continues in its little niche making a living for a couple of people" now helps me better understand why in the mainlining process there was such insistence on behalf of the author to keep the support for other OSes.

Dare I say it but the clear implication is Yaffs makes money but only on the other OSes, and the insistence by the "kernel people" to loose the other OS support was forcing a choice between mainlining it and not continuing to make a living, or making a living by keeping it out of mainline.

Strangely I cannot recall that point being made in the mainlining process? Was it made and I missed it, or was it felt inappropriate and unlikely to further the mainlining process? Either way, I now think the author make the "right choice" in not continuing to mainline it.

I like YAFFS and it was the first workable flash filesystem I used in ~ 2002 back at a time when JFFS2 was worse than useless. The fact YAFFS tends to get written out of Linux kernel history, and the abortive mainlining process doesn't tend to show the kernel community in much glory.

JFFS2, UBIFS, and the growth of flash storage

Posted Dec 14, 2012 18:25 UTC (Fri) by wookey (guest, #5501) [Link]

Yes, it has that strange business model where most users use it for free under the GPL, but some (largely the ones that actually pay money and make the business viable use) it in other contexts (bootloaders, proprietary OSes, in-house 'stuff'). GPL users sometimes pay for enhancements too, and it was GPL users that paid for most of the initial development.

Yes this wasn't really discussed as part of the mainlining - as it's not technical and thus not really very relevant. The point was made that the FS wasn't only used with Linux and keeping it working for the others (from a single, or at least very similar, codebase) was important.

It's not quite as black and white as a "a choice between mainlining it and not continuing to make a living", it was just that there was a point beyond which the advantages of mainlining (saving maintenance effort, wider exposure, making life a bit easier for linux users) were not sufficient to justify the disadvantages (extra maintenance effort due to divergence) from the author's POV. He decided he'd given it his best shot and been rebuffed. He's not a pushy guy.

It seems no-one else has cared enough to try again in the couple of years since then, probably at least partly because they don't feel they have the moral right to do that.

JFFS2, UBIFS, and the growth of flash storage

Posted Jan 11, 2013 11:38 UTC (Fri) by oak (guest, #2786) [Link]

This article doesn't mention which compression algorithm was used with UBIFS.

Compression algorithm has a large effect on the speed. For example LZO is very fast at uncompressing and compression is also fast, although that depends on what LZO compression level is used.

Whereas ZLIB compression is significantly slower for both, but provides better compression results.

Which one to choose depends on:
* What kind of data is being stored
* Flash write and read speeds
* Algorithm compression and uncompression speed compared to those (which depends on CPU speed)

Article mentions using high speed PC, so for it algorithm with best compression ratio would probably be best. On Nokia's Maemo devices low level LZO compression was used at run-time, but pre-made root file systems were compressed using highest LZO compression level (this explains why apt-get upgrade could leave less disk space although binaries were same size).

As to data types, non-compressable data like music, videos etc is typically user data and it can even be on a separate partition / storage media (on Maemo devices, SD card) with different file system, than the root file system which contains the binary data, logs etc.

JFFS2, UBIFS, and the growth of flash storage

Posted Jan 17, 2013 11:21 UTC (Thu) by meuh (guest, #22042) [Link]

for a PC-style file attribute table
erk !
This is Disk Operating System (DOS) filesystem style if you want, but not tied to a particuliar hardware.

missed JFFS2's Erase Block Summary (EBS) feature ?

Posted Feb 26, 2013 6:02 UTC (Tue) by vapier (guest, #15768) [Link]

JFFS2 has had Erase Block Summary (EBS) support for quite a long time which drastically speeds up mount time:
http://www.linux-mtd.infradead.org/doc/jffs2.html

some actual performance numbers shows this can easily be a 6x speed increase:
http://docs.blackfin.uclinux.org/doku.php?id=linux-kernel...

covering yaffs would also have been cool :).

i wonder if the changes you made to block2mtd result in numbers that are really comparable. by faking out the erase steps (which in a real flash is not free -- erasing tends to be the slowest operation), ubifs is no longer resilient to power losses right ? unlike the others which would be able to recover. so you've given a nice speed increase to ubifs w/out any such grant to the others. or am i missing something obvious ?