LogFS

[Posted May 15, 2007 by corbet]

Rotating magnetic storage technology has served us well for a long time. It offers high capacities (for an ever-increasing value of "high"), relatively fast and relatively uniform access times, and relatively good reliability. It is generally accepted that rotating disks will be part of our systems for some time yet. For smaller sizes, however, disks are increasingly being pushed aside by solid state flash memory - and "smaller" is an ever-increasing value as well. Flash is more compact, requires less power, and offers truly random access, so it's not surprising to see it being deployed in more situations.

Flash is not without its drawbacks. Its relatively high cost limits its applications and it brings its own set of quirks which must be understood and addressed by filesystem developers. Even so, some special-purpose laptops rely on flash for their persistent storage needs now, and there are rumors of more flash-based systems in the near future.

The most significant of the "quirks" mentioned above are:

Flash storage cannot be simply overwritten like magnetic storage; instead, a flash block must be explicitly erased and rewritten in two separate steps. The size of the "erase blocks" may not match the block size as understood by the operating system; often, the erase blocks are relatively large.
There are limits to the number of times a block of flash memory can be erased and rewritten before it loses the ability to reliably store data. That limit is generally around 100,000 cycles.

These hardware features have some interesting implications. What, for example, happens when the operating system decides to rewrite a single block within a larger flash erase block? A naive implementation would read the entire erase block, perform the erase operation, then write the data back with the new portion included. Should the system go down in the middle of this operation, however, all of the data within the erase block may be lost forever. If the operating system ignores the block lifetime issues, it is likely to cycle some erase blocks much more frequently than others, significantly shortening the overall life of the device. When one is dealing with a low-duty-cycle device, such as a USB thumb drive, it's possible to get away with ignoring the limitations that flash has. When a flash drive is the primary storage device, though, a smarter approach is called for.

Being smarter is usually a matter of using a filesystem which was explicitly designed to work well with flash hardware. These filesystems can dispense with the great care that other filesystems must take in how blocks are laid out - there are no seek time or rotational latency issues with flash drives. On the other hand, flash-aware filesystems must be written with erase cycles in mind; they must not risk losing data during these cycles and they should endeavor to spread these cycles across the drive to maximize its lifetime.

The end result is that filesystems designed for flash devices take the log-structured approach. The device is treated like a sort of circular buffer, with new data always being written to the end. This approach makes for fast write operations, but the read side can be a more complex story. One approach taken is to attach some metadata to each erase block describing which file that block belongs to and its version number. When an erase block is to be rewritten, a new copy is made at the end with a higher version number; reading the file is simply a matter of finding the erase block with the highest version number.

Finding that block requires scanning the disk - something which, most likely, one does not want to do for every read operation. The in-kernel JFFS2 filesystem solves this problem by performing a scan when the filesystem is mounted. It builds an in-memory data structure which speeds subsequent accesses considerably. There is a cost, though: the initial scan can make mounting slow, and the in-memory tree can take a considerable amount of space. Given that flash filesystems are often used on small, embedded systems - where both boot time and memory are at a premium - these costs are significant.

Jörn Engel thinks he has a better way in the form of the LogFS filesystem, currently proposed for inclusion into the mainline. The core idea behind LogFS is that, rather than building the filesystem tree at mount time, the filesystem code should store the tree on the device itself, much like traditional filesystems do. Putting the tree on the flash device reduces mount times (Jörn says that an OLPC system goes from 3.3 seconds under JFFS2 to 60ms under LogFS) and should reduce the runtime memory requirements considerably.

The on-flash tree looks much like the structure used by ext2. There are some differences in how it is managed, however. The log structure of the filesystem implies that blocks cannot be rewritten in place; any time a block is changed it must be moved and written to a new location. If there are pointers to the moved block (think about the usual indirect blocks used to store the layout of larger files), the blocks containing the pointers must also be changed, and thus moved. That, in turn, will require changes at the next level up in the tree. Thus changes at the bottom of the tree will propagate upward all the way to the root. This is the "wandering tree" algorithm. One of the advantages is that the old filesystem structure remains valid until the root is rewritten - a crash could cause the loss of the last operation, but it will leave previous data and the structure of the filesystem intact.

Actually managing the entire directory tree as a wandering tree would be expensive; beyond that, files with multiple hard links break the tree structure and make wandering trees much harder to implement. So the actual tree implemented by LogFS just has two levels. There is an "inode file" containing the inode structures for every file and directory existing within the filesystem; each inode then points to the associated blocks holding the file's data. Directory entries contain a simple integer index giving the inode offset within the inode file. So changes to an inode only require writing the inode itself and the inode file; the rest of the directory structure need not be touched.

To tie it all together, LogFS sets aside a group of blocks as the "anchor area," where versioned pointers to the root inode can be found. Mounting the filesystem requires scanning this anchor area to find the current version of the root inode, at which point the rest of the filesystem becomes accessible. This mechanism allows the root to be found in constant time without the need to scan the entire device.

LogFS has been through a couple rounds of review, with significant changes each time. Barring significant problems, it should be getting close to ready, perhaps it will be merged in time for 2.6.23.

(See also: Jörn's LogFS paper from which much of the above was cribbed).

Index entries for this article
Kernel	Filesystems/Flash
Kernel	LogFS

FTL and obsolete sectors

Posted May 18, 2007 20:32 UTC (Fri) by seanyoung (subscriber, #28711) [Link]

The logfs paper mentions obsoleting data in an ftl. I wrote a patch for that some time back but it didn't work all that well.

http://lwn.net/Articles/162776/

I introduced a new barrier request "forget" (obsolete would have been a better name). The ftl block device can used this information; however the barrier request caused more writes than without, as write requests before and after the barrier cannot be merged.

For example, the fat table itself is repeatedly written. If:

- write fat
- forget barrier request
- write fat again

Without the forget barrier the first write can be merged with second. So the only way to do this properly is by adding functionality to the block layer which can merge forget/obsolete requests. So the patch as describes actually increases I/O.

Another problem was that the in-kernel FTL layers are not used much nowadays, so the benefit would be limited. OTOH, the CompactFlash ATA command set does have an "erase" command which could possibly do exactly this -- I never verified this. However if it does compact flash memory would be faster and last longer.

LogFS

Posted May 24, 2007 11:03 UTC (Thu) by Duncan (guest, #6647) [Link]

So we have a "wandering journal", much like reiser4 is supposed to have.
What would be the possibility of adding a plugin to reiser4 to make it
behave with flash requirements in mind as opposed to rotating disc
requirements?

Of course, reiser4 isn't in mainline yet, but supposedly, it's getting
closer, and with Namesys employees continuing to work toward that, and
with Hans himself (fs genius but political sand in the gears he often
seemed to be) out of the equation now, merging for 2.6.23 or .24 is
beginning to look both politically and technically possible. (See this
article from the April 26 LWN kernel page, "Filesystems: chunkfs and
reiser4", http://lwn.net/Articles/231585/ .)

Duncan

LogFS

Posted Jul 9, 2007 2:20 UTC (Mon) by crucialp (guest, #46155) [Link]

This is a very interesting idea, and great to see how it could be included in mainline.