LWN.net Logo

Kernel development

Brief items

Kernel release status

The 3.7 kernel is out, released by Linus on December 10. "Anyway, it's been a somewhat drawn out release despite the 3.7 merge window having otherwise appeared pretty straightforward, and none of the rc's were all that big either. But we're done, and this means that the merge window will close on Christmas eve." Of course, "drawn out" is a relative term; at 72 days, this cycle is only a few days above average in length. Headline features in this kernel include 64-bit ARM support, improved security with supervisor-mode access prevention, SMB 2.1 support, server-side TCP fast open support, signed kernel modules, and more. See the KernelNewbies 3.7 page for details.

Stable updates: 3.0.55 and 3.4.22, containing a build error fix, were released on December 5. The rather larger 3.2.35 update was released on December 7. 3.0.56, 3.4.23 and 3.6.10 were released on December 10. No stable updates are in the review process as of this writing.

Comments (none posted)

Quotes of the week

Modern network hardware has often sprouted various "offload" engines, unfortunately now often enabled by default, which tend to do more damage than good except for extreme benchmarking fanatics, often primarily on big server machines in data centers. Start by turning them off. We'll write more on this topic soon. The implementors of this "smart" hardware are less "smart" than they think they are.
The Bufferbloat project on CoDel benchmarking best practices

When we enter the kernel mode, we start with saving CPU state. Usually (and you are going to hate that word well before you read to the end) it's stored in struct pt_regs, but it might be more complicated. For our purposes it's better to think of it as abstract saved state, leaving aside the question of how it's represented.
Al Viro teaches a class on signal handling

Comments (3 posted)

Quotes of the week: FALLOC_FL_NO_HIDE_STALE edition

    If (maintainer thinks their patch is right) {
	patch doesn't need review
    } else {
	/* maintainer thinks the patch is wrong. */
	/* XXX: why would you think your own patch is wrong? */
	patch needs review
    }
Dave Chinner

Review is part of the way we work as a community and we should figure out how to fix our review process so that we can have meaningful results from the review or we lose confidence in the process and it makes it much harder to get reviewers to spend time reviewing when their reviews are ultimately ignored.
Ric Wheeler

Anybody who claims that our "process" requires that things like that go on the mailing list and pass long reviews and discussions IS JUST LYING.

Because it's not true. We discuss big features, and we want code review, yes, but the fact is, most small obvious patches do *not* get reviewed, they just get fixed. You all know that, why the hell are you suddenly claiming that this is so magically different?

Linus Torvalds

This is why this discussion reminds me so much of the wakelocks discussion, and why I've made the same decision the Android folks made, except they wasted far more time and got far more frustrated --- I'll just keep the damned thing as a out-of-tree patch, until there are enough other people willing to say that they need and are using this patch because their workloads and use cases need it. It will save me a whole lot of time.
Ted Ts'o

Comments (none posted)

Kernel development news

3.8 Merge window part 1

By Jonathan Corbet
December 12, 2012
The 3.8 merge window looks to be an interesting time. In theory, it closes just before the Christmas holiday, though Linus has threatened to start his celebrations early. Despite the possibly shortened window, there are, according to linux-next maintainer Stephen Rothwell, "more commits in linux-next than ever before." So expect to see a lot of changes flowing into the mainline in a relatively short period of time.

As of this writing, some 3800 of those changes have been merged by Linus. The most significant user-visible changes include:

  • The cpuidle subsystem is now able to associate different drivers with each CPU. This capability is needed to support asymmetric architectures like big.LITTLE.

  • Linux running as a Microsoft Hyper-V guest now supports memory-use reduction via the Hyper-V balloon driver.

  • Applications accessing huge pages via the mmap() or SYSV IPC interfaces can now specify which page size they want.

  • The x86 architecture code can, finally, support hotplugging of the initial boot CPU ("CPU0").

  • On the other hand, as discussed at the 2012 Kernel Summit, support for the venerable 386 architecture has been removed from the kernel. Peter Anvin informed Linus of an important loss of functionality from this change: "Unfortunately there's a nostalgic cost: your old original 386 DX33 system from early 1991 won't be able to boot modern Linux kernels anymore." Linus was unmoved, though, and merged the change.

  • The XFS filesystem has gained a new verification mechanism that can detect corrupted data read from the storage device.

  • New hardware support includes:

    • Processors and systems: Broadcom BCM281XX SoCs, Allwinner A1X SoCs, USI Topkick boards, ZyXEL NSA-310 boards, MPL CEC4 boards, and Samsung EXYNOS5440 SoCs. Support for SH-Mobile SH7367 and SH7377 CPUs has been removed. Also removed is support for the Intel PXA2xx/PXA3xx Marvell PXA95x architecture on the assumption that nobody will miss it; anybody who disagrees may want to do so in the near future.

    • Memory-technology devices: Wondermedia SD/MMC host controllers, and Realtek PCI-E SD/MMC and Memstick card interfaces.

    • Miscellaneous: Texas Instruments ADS7830 analog-to-digital converters (ADCs), Texas Instruments ADC081C021/027 ADCs, Dialog Semiconductor DA9055 ADCs, Analog Device AD54xx digital-to-analog converters, ST Microelectronics SPEAr PLGPIO controllers, Dialog Semiconductor DA9055 GPIO controllers, Cirrus Logic CLPS711x/EP721x/EP731x-based GPIO controllers, Technologic Systems TS-5500 digital I/O controllers, Exar XR17V35x multi-port PCIe UARTs, ARC (Synopsys) UARTs, SystemBase PCI Multiport UARTs, Commtech Fastcom Async-335 and Fastcom Async-PCIe cards, ACPI enumerated SDHCI controllers, Firewire-attached TTY devices, Analog devices ADIS16136 gyroscopes, and Analog Devices ADIS16480 inertial measurement units.

    • Thermal: The kernel has a new thermal governor subsystem capable of responding when the system gets too hot. A driver has been added for ST-Ericsson DB8500 thermal regulators.

    • USB: Renesas R-Car USB phys.

    • Staging graduations: the IndustryPack bus driver, Maxim max1363 ADC driver, Analog Devices AD7793 ADC driver, and Analog Devices AD7298 ADC driver have moved out of the staging tree. The RealTek PCI-E card reader driver has been removed from staging since that functionality is now provided by a separate mainline driver.

Changes visible to kernel developers include:

  • The _devinit, __devinitdata, __devinitconst, __devexit, and __devexit_p() macros are on their way out; many drivers have been fixed to stop using them. In the future, the CONFIG_HOTPLUG option will no longer exist, so initialization and finalization code needs to be kept around forever.

  • The power management quality-of-service subsystem can now support device-specific QOS flags. Two flags have been defined in 3.8: PM_QOS_FLAG_NO_POWER_OFF and PM_QOS_FLAG_REMOTE_WAKEUP.

  • The devfreq subsystem now supports devices that can be suspended (or placed into an idle state) independently of the rest of the system.

  • The UIO driver subsystem has a new generic platform driver allowing UIO devices to access memory allocated by CMA or the IOMMU subsystem.

  • The per-entity load-tracking patch set has been merged; this code allows the scheduler to better understand which processes (and control groups) are putting load on the system, thus improving load balancing and related decisions.

  • The callback-free RCU implementation has been merged, allowing the offloading of some read-copy-update overhead from a subset of CPUs in the system.

The 3.8 merge window has just begun; there are a lot of subsystem trees yet to be pulled into the mainline. LWN will continue to follow the kernel repository as Linus pulls in more patches and establishes the feature set for the 3.8 release; stay tuned.

Comments (9 posted)

User namespaces progress

By Michael Kerrisk
December 13, 2012

The results of the user namespaces work on Linux have been a long time in coming, probably because they are the most complex of the various namespaces that have been added to the kernel so far. The first pieces of the implementation started appearing when Linux 2.6.23 (released in late 2007) added the CLONE_NEWUSER flag for the clone() and unshare() system calls. By Linux 2.6.29, that flag also became meaningful in the clone() system call. However, until now many of the pieces necessary for a complete implementation have remained absent.

We last looked at user namespaces back in April, when Eric Biederman was working to push a raft of patches into the kernel with the goal of bringing the implementation closer to completion. Eric is now engaged in pushing further patches into the kernel with the goal of having a more or less complete implementation of user namespaces in Linux 3.8. Thus, it seems to be time to have another look at this work. First, however, a brief recap of user namespaces is probably in order.

User namespaces allow per-namespace mappings of user and group IDs. In the context of containers, this means that users and groups may have privileges for certain operations inside the container without having those privileges outside the container. (In other words, a process's set of capabilities for operations inside a user namespace can be quite different from its set of capabilities in the host system.) One of the specific goals of user namespaces is to allow a process to have root privileges for operations inside the container, while at the same time being a normal unprivileged process on the wider system hosting the container.

To support this behavior, each of a process's user IDs has, in effect, two values: one inside the container and another outside the container. Similar remarks hold true for group IDs. This duality is accomplished by maintaining a per-user-namespace mapping of user IDs: each user namespace has a table that maps user IDs on the host system to corresponding user IDs in the namespace. This mapping is set and viewed by writing and reading the /proc/PID/uid_map pseudo-file, where PID is the process ID of one of the processes in the user namespace. Thus, for example, user ID 1000 on the host system might be mapped to user ID 0 inside a namespace; a process with a user ID of 1000 would thus be a normal user on the host system, but would have root privileges inside the namespace. If no mapping is provided for a particular user ID on the host system, then, within the namespace, the user ID is mapped to the value provided in the file /proc/sys/kernel/overflowuid (the default value in this file is 65534). Our earlier article went into more details of the implementation.

One further point worth noting is that the description given in the previous paragraph looks at things from the perspective of a single user namespace. However, user namespaces can be nested, with user and group ID mappings applied at each nesting level. This means that a process might have distinct user and group IDs in each of the nested user namespaces in which it is a member.

Eric has assembled a number of namespace-related patch sets for submission in the upcoming 3.8 merge window. Chief among these is the set that completes the main pieces of the user namespace infrastructure. With the changes in this set, unprivileged processes can now create new user namespaces (using clone(CLONE_NEWUSER)). This is safe, says Eric, because:

Now that we have been through every permission check in the kernel having uid == 0 and gid == 0 in your local user namespace no longer adds any special privileges. Even having a full set of caps in your local user namespace is safe because capabilities are relative to your local user namespace, and do not confer unexpected privileges.

The point that Eric is making here is that following the work (described in our earlier article) to implement the kuid_t and kgid_t types within the kernel, and the conversion of various calls to capable() to its namespace analog, ns_capable(), having a user ID of zero inside a user namespace no longer grants special privileges outside the namespace. (capable() is the kernel function that checks whether a process has a capability; ns_capable() checks whether a process has a capability inside a namespace.)

The creator of a new user namespace starts off with a full set of permitted and effective capabilities within the namespace, regardless of its user ID or capabilities on the host system. The creating process thus has root privileges, for the purpose of setting up the environment inside the namespace in preparation for the creation or the addition of other processes inside the namespace. Among other things, this means that the (unprivileged) creator of the user namespace (or indeed any process with suitable capabilities inside the namespace) can in turn create all other types of namespaces, such as network, mount, and PID namespaces (those operations require the CAP_SYS_ADMIN capability). Because the effect of creating those namespaces is limited to the members of the user namespace, no damage can be done in the host system.

Other notable user-space changes in Eric's patches include extending the unshare() system call so that it can be employed to create user namespaces, and extensions that allow a process to use the setns() system call to enter an existing user namespace.

Looking at some of the other patches in the series gives an idea of just how subtle some of the details are that must be dealt with in order to create a workable implementation of user namespaces. For example, one of the patches deals with the behavior of set-user-ID (and set-group-ID) programs. When a set-user-ID program is executed (via the execve() system call), the effective user ID of the process is changed to match the user ID of the executable file. When a process inside a user namespace executes a set-user-ID program, the effect is to change the process's effective user ID inside the namespace to whatever value was mapped for the file user ID. Returning to the example used above, where user ID 1000 on the host system is mapped to user ID 0 inside the namespace, if a process inside the user namespace executes a set-user-ID program owned by user ID 1000, then the process will assume an effective user ID of 0 (inside the namespace).

However, what should be done if the file user ID has no mapping inside the namespace? One possibility would be for the execve() call to fail. However, Eric's patch implements another approach: the set-user-ID bit is ignored in this case, so that the new program is executed, but the process's effective user ID is left unchanged. Eric's reasoning is that this mirrors the semantics of executing a set-user-ID program that resides on a filesystem that was mounted with the MS_NOSUID flag. Those semantics have been in place since Linux 2.4, so the kernel code paths should for this behavior should be well tested.

Another notable piece of work in Eric's patch set concerns the files in the /proc/PID/ns directory. This directory contains one file for each type of namespace of which the process is a member (thus, for each process, there are the files ipc, mnt, net, pid, user, and uts). These files already serve a couple of purposes. Passing an open file descriptor for one of these files to setns() allows a process to join an existing namespace. Holding an open file descriptor for one of these files, or bind mounting one of the files to some other location in the filesystem, will keep a namespace alive even if all current members of the namespace terminate. Among other things, the latter feature allows the piecemeal construction of the contents of a container. With this patch in Eric's recent series, a single /proc inode is now created per namespace, and the /proc/PID/ns files are instead implemented as special symbolic links that refer to that inode. The practical upshot is that if two processes are in, say, the same user namespace, then calling stat() on the respective /proc/PID/ns/user files will return the same inode numbers (in the st_ino field of the returned stat structure). This provides a mechanism for discovering if two processes are in the same namespace, a long-requested feature.

This article has covered just the patch set to complete the user namespace implementation. However, at the same time, Eric is pushing a number of related patch sets towards the mainline, including: changes to the networking stack so that user namespace root users can create network namespaces: enhancements and clean-ups of the PID namespace code that, among other things, add unshare() and setns() support for PID namespaces; enhancements to the mount namespace code that allow user namespace root users to call chroot() and to create and manipulate mount namespaces; and a series of patches that add support for user namespaces to a number of file systems that do not yet provide that support.

It's worth emphasizing one of the points that Eric noted in a documentation patch for the user namespace work, and elaborated on in a private mail. Beyond the practicalities of supporting containers, there is another significant driving force behind the user namespaces work: to free the UNIX/Linux API of the "handcuffs" imposed by set-user-ID and set-group-ID programs. Many of the user-space APIs provided by the kernel are root-only simply to prevent the possibility of accidentally or maliciously distorting the run-time environment of privileged programs, with the effect that those programs are confused into doing something that they were not designed to do. By limiting the effect of root privileges to a user namespace, and allowing unprivileged users to create user namespaces, it now becomes possible to give non-root programs access to interesting functionality that was formerly limited to the root user.

There have been a few Acked-by: mails sent in response to Eric's patches, and a few small questions, but the patches have otherwise passed largely without comment, and no one has raised objections. It seems likely that this is because the patches have been around in one form or another for a considerable period, and Eric has gone to considerable effort to address objections that were raised earlier during the user namespaces work. Thus, it seems that there's a good chance that Eric's pull request to have the patches merged in the currently open 3.8 merge window will be successful, and that a complete implementation of user namespaces is now very close to reality.

Comments (19 posted)

JFFS2, UBIFS, and the growth of flash storage

December 11, 2012

This article was contributed by Neil Brown

When thinking about filesystems for modern flash storage devices, as we have recently done with f2fs and NILFS2, two other filesystems that are likely to quickly spring to mind, and be almost as quickly discarded, are JFFS2 and UBIFS. They spring to mind because they were designed specifically to work with flash, and are discarded because they require access to "raw flash" whereas the flash devices we have been considering have a "flash translation layer" (FTL) which hides some of the details of the flash device and which needs to be accessed much like a disk drive.

This quick discarding may well not be appropriate — these are open-source filesystems after all and are thus free to be tinkered with. If the Apollo 13 technicians were able to link the Lithium hydroxide canisters from the command module to the CO₂ scrubber in the Lunar module, it shouldn't be too hard for us to link a raw-flash filesystem to a FTL based storage chip — if it seemed like a useful thing to do.

Raw access to a flash device goes through the "mtd" (Memory Technology Devices) interface in Linux and, while this is a rich interface, the vast majority of accesses from a filesystem are via three functions: mtd_read(), mtd_write() and mtd_erase(). The first two are easily implemented by a block device — though you need to allow for the fact that the mtd interface is synchronous while the block layer interface is asynchronous — and the last can be largely ignored as an FTL handles erasure internally. In fact, Linux provides a "block2mtd" device which will present an arbitrary block device as an mtd device. Using this might not be the most efficient way to run a filesystem on new hardware, but it would at least work as a proof-of-concept.

So it seems that there could be some possibility of using one of these filesystems, possibly with a little modification, on an FTL-based flash device, and there could certainly be value in understanding them a little better as, at the very least, they could have lessons to teach us.

A common baseline

Despite their separate code bases, there is a lot of similarity between JFFS2 and UBIFS — enough that it seems likely that the latter was developed in part to overcome the shortcomings of the former. One similarity is that, unlike the other filesystems we have looked at, neither of these filesystems has a strong concept of a "basic block size". The concept is there if you look for it, but it isn't prominent.

One of the main uses of a block size in a filesystem is to manage free space. Some blocks are in use, others are free. If a block is only partially used — for example if it contains the last little bit of a file — then the whole block is considered to be in use. For flash filesystems, blocks are not as useful for free-space management as this space is managed in terms of "erase blocks," which are much larger than the basic blocks of other filesystems, possibly as large as a few megabytes. Another use of blocks in a filesystem is as a unit of metadata management. For example NILFS2 manages the ifile (inode file) as a sequence of blocks (rather than just a sequence of inodes), while F2FS manages each directory as a set of hash tables, each of which contains a fixed number of blocks.

JFFS2 and UBIFS don't take this approach at all. All data is written consecutively to one or more erase blocks with some padding to align things to four-byte boundaries, but with no alignment so large that it could be called a block. When indexing of data is needed, an erase-block number combined with a byte offset meets the need, so the lack of alignment does not cause an issue there.

Both filesystems further make use of this freedom in space allocation by compressing the data before it is written. Various compression schemes are available including LZO and ZLIB together with some simpler schemes like run-length encoding. Which scheme is chosen depends on the desired trade off between space saving and execution time. This compression can make a small flash device hold nearly twice as much as you might expect, depending on the compressibility of the files of course. Your author still recalls the pleasant surprise he got when he found out how much data would fit on the JFFS2 formatted 256MB flash in the original Openmoko Freerunner: a reasonably complete Debian root filesystem with assorted development tools and basic applications still left room for a modest amount of music and some OSM map tiles.

In each case, the data and metadata of the filesystem are collected into "nodes" which are concatenated and written out to a fresh erase block. Each node records the type of data (inode, file, directory name, etc), the address of the data (such as inode number), the type of compression and a few other details. This makes it possible to identify the contents of the flash when mounting and when cleaning, and effectively replaces the "segment summary" that is found in f2fs and NILFS2.

Special note should be made of the directory name nodes. While the other filesystems we have studied store a directory much like a file, with filenames stored at various locations in that file, these two filesystems do not. Each entry in the directory is stored in its own node, and these nodes do not correspond to any particular location in a "file" — they are simply unique entries. JFFS2 and UBIFS each have their own particular way of finding these names as we shall see, but in neither case is the concept of a file offset part of that.

The one place where a block size is still visible in these filesystems is in the way they chop a file up into nodes for storage. In JFFS2, a node can be of any size up to 4KB so a log file could, for example, be split up as one node per line. However the current implementation always writes whole pages — to quote the in-line commentary, "It sucks, but it's simple". For UBIFS, data nodes must start at a 4KB-aligned offset in the file so they are typically 4KB in size (before compression) except when at the end of the file.

JFFS2 — the journaling flash filesystem

A traditional journaling filesystem, such as ext3 or xfs, adds a journal to a regular filesystem. Updates are written first to the journal and then to the main filesystem. When mounting the filesystem after a shutdown, the journal is scanned and anything that is found is merged into the main filesystem, thus providing crash tolerance. JFFS2 takes a similar approach with one important difference — there is no "regular filesystem". With JFFS2 there is only a journal, a journal that potentially covers the entire device.

It is probably a little misleading to describe JFFS2 as "just one journal". This is because it might lead you to think that when it gets to the end of the journal it just starts again at the beginning. While this was true of JFFS1, it is not for JFFS2. Rather it might be clearer to think of each erase block as a little journal. When one erase block is full, JFFS2 looks around for another one to use. Meanwhile if it notices that some erase blocks are nearly empty it will move all the active nodes out of them into a clean erase block, and then erase and re-use those newly-cleaned erase blocks.

When a JFFS2 filesystem is mounted, all of these journals, and thus the entire device, are scanned and every node found is incorporated into an in-memory data structure describing the filesystem. Some nodes might invalidate other nodes; this may happen when a file is created and then removed: there will be a node recording the new filename as belonging to some directory, and then another node recording that the filename has been deleted. JFFS2 resolves all these modifications and ends up with a data structure that describes the filesystem as it was that last time something was written to it, and also describes where the free space is. The structure is kept as compact as possible and naturally does not contain any file data; instead, it holds only the addresses where the data should be found and so, while it will be much smaller than the whole filesystem, it will still grow linearly as the filesystems grows.

This need to scan the entire device at mount time and store the skeleton of the filesystem in memory puts a limit on the size of filesystem that JFFS2 is usable for. Some tens of megabytes, or even a few hundred megabytes, is quite practical. Once the device gets close to, or exceeds, a gigabyte, JFFS2 become quite impractical. Even if memory for storing the tree were cheap, time to mount the filesystem is not.

This is where UBIFS comes in. While the details are quite different, UBIFS is a lot like JFFS2 with two additions: a tree to index all the nodes in the filesystem, and another tree to keep track of free space. With these two trees, UBIFS avoids both the need to scan the entire device at mount time and the need to keep a skeleton of the filesystem in memory at all times. This allows UBIFS to scale to much larger filesystems — certainly many tens of gigabytes and probably more.

But before we look too closely at these trees it will serve us well to look at some of the other details and in particular at "UBI", a layer between the MTD flash interface layer and UBIFS. UBI uses an unsorted collection of flash erase blocks to present a number of file system images; UBI stands for Unsorted Block Images.

UBI — almost a Flash Translation Layer

The documentation for UBI explicit states that it is not a flash translation layer. Nonetheless it shares a lot of functionality with an FTL, particularly wear leveling and error management. If you imagined UBI as an FTL where the block size was the same as the size of an erase block, you wouldn't go far wrong.

UBI uses a flash device which contains a large number of Physical Erase Blocks (PEBs) to provide one or more virtual devices (or "volumes") which each consist of a smaller number of Logical Erase Blocks (LEBs), each slightly smaller than a PEB. It maintains a mapping from LEB to PEB and this mapping may change from time to time due to various causes including:

  • Writing to an LEB. When an LEB is written, the data will be written to a new, empty, PEB and the mapping from LEB to PEB will be updated. UBI is then free to erase the old PEB at its leisure. Normally, the first new write to an LEB will make all the data previously there inaccessible. However, a feature is available where the new PEB isn't committed until the write request completes. This ensures that after a sudden power outage, the LEB will either have the old data or the complete new data, never anything else.

  • Wear leveling. UBI keeps a header at the start of each PEB which is rewritten immediately after the block is erased. One detail in the header is how many times the PEB has been written and erased. When UBI notices that the difference between the highest write count and the lowest write count in all the PEBs gets too high (based on a compile-time configuration parameter: MTD_UBI_WL_THRESHOLD), it will move an LEB stored in a PEB with a low write count (which is assumed to be stable since the PEB containing it has not been rewritten often) to one with a high write count. If this data continues to be as stable as it has been, this will tend to reduce the variation among write counts and achieve wear leveling.

  • Scrubbing. NAND flash includes an error-correcting code (ECC) for each page (or sub-page) which can detect multiple-bit errors and correct single-bit errors. When an error is reported while reading from a PEB, UBI will relocate the LEB in that PEB to another PEB so as to guard against a second bit error, which would be uncorrectable. This process happens transparently and is referred to as "scrubbing".

The functionality described above is already an advance on the flash support that JFFS2 provides. JFFS2 does some wear leveling but it is not precise. It keeps no record of write counts but, instead, decides to relocate an erase-block based on the roll of a dice (or actually the sampling of a random number) instead. This probably provides some leveling of wear, but there are no guarantees. JFFS2 also has no provision for scrubbing.

The mapping from PEB to LEB is stored spread out over all active erase blocks in the flash device. After the PEB header that records the write count there is a second header which records the volume identifier and LEB number of the data stored here. To recover this mapping at mount time, UBI needs to read the first page or two from every PEB. While this isn't as slow as reading every byte like JFFS2 has to, it would still cause mount time to scale linearly with device size — or nearly linearly as larger devices are likely to have larger erase block sizes.

Recently this situation has improved. A new feature known has "fastmap" made its way into the UBI driver for Linux 3.7. Fastmap stores a recent copy of the mapping in some erase block together with a list of the several (up to 256) erase blocks which will be written next, known as the pool. The mount process then needs to examine the first 64 PEBs to find a "super block" which points to the mapping, read the mapping, and then read the first page of each PEB in the pool to find changes to the mapping. When the pool is close to exhaustion, a new copy of the mapping with a new list of pool PEBs is written out. This is clearly a little more complex, but puts a firm cap on the mount time and so ensures scalability to much larger devices.

UBIFS — the trees

With UBIFS, all the filesystem content — inodes, data, and directory entries — is stored in nodes in various arbitrary Logical Erase Blocks, and the addresses of these blocks are stored in a single B-tree. This is similar in some ways to reiserfs (originally known as "treefs") and Btrfs, and contrasts with filesystems like f2fs, NILFS2 and ext3 where inodes, file data, and directory entries are all stored with quite different indexing structures.

The key for lookup in this B-tree is 64 bits wide, formed from a 32-bit inode number, a three-bit node type, and a 29-bit offset (for file data) or hash value (for directory entries). This last field, combined with a 4KB block size used for indexing, limits the size of the largest file to two terabytes, probably the smallest limit in the filesystem.

Nodes in this B-tree are, like other nodes, stored in whichever erase block happens to be convenient. They are also like other nodes in that they are not sized to align with any "basic block" size. Rather the size is chosen based on the fan-out ratio configured for the filesystem. The default fan-out is eight, meaning that each B-tree node contains eight keys and eight pointers to other nodes, resulting in a little under 200 bytes per node.

Using small nodes means that fewer bytes need to be written when updating indexes. On the other hand, there are more levels in the tree so more reading is likely to be required to find a node. The ideal trade off will depend on the relative speeds of reads and writes. For flash storage that serves reads a lot faster than writes — which is not uncommon, but seemingly not universal — it is likely that this fan-out provides a good balance. If not, it is easy to choose a different fan-out when creating a filesystem.

New nodes in the filesystem do not get included in the indexing B-tree immediately. Rather, their addresses are written to a journal, to which a few LEBs are dedicated. When the filesystem is mounted, this journal is scanned, the nodes are found, and based on the type and other information in the node header, they are merged into the indexing tree. This merging also happens periodically while the filesystem is active, so that the journal can be truncated. Those nodes that are not yet indexed are sometimes referred to as "buds" — a term which at first can be somewhat confusing. Fortunately the UBIFS code is sprinkled with some very good documentation so it wasn't too hard to discover that "buds" were nodes that would soon be "leaves" of the B-tree, but weren't yet — quite an apt botanical joke.

Much like f2fs, UBIFS keeps several erase blocks open for writes at the same time so that different sorts of data can be kept separate from each other, which, among other things, can improve cleaning performance. These open blocks are referred to as different "journal heads". UBIFS has one "garbage collection" head where the cleaner writes nodes that it moves — somewhat like the "COLD" sections in f2fs. There is also a "base" head where inodes, directory entries, and other non-data nodes are written — a bit like the "NODE" sections in f2fs. Finally, there are one or more "data" heads where file data is written, though the current code doesn't appear to actually allow the "or more" aspect of the design.

The other tree that UBIFS maintains is used for keeping track of free space or, more precisely, how many active nodes there are in each erase block. This tree is a radix tree with a fan-out of four. So if you write the address of a particular LEB in base four (also known as radix-four), then each digit would correspond to one level in the tree, and its value indicates which child to follow to get down to the next level.

[Radix tree diagram]

This tree is stored in a completely separate part of the device with its own set of logical erase blocks, its own garbage collection, and consequently its own table of LEB usage counters. This last table must be small enough to fit in a single erase block and so imposes a (comfortably large) limit on the filesystem size. Keeping this tree separate seems like an odd decision, but doubtlessly simplifies the task of keeping track of device usage. If the node that records the usage of an LEB were to be stored in that LEB, there would be additional complexity which this approach avoids.

A transition to FTL?

While JFFS2 clearly has limits, UBIFS seem to be much less limited. With 32 bits to address erase blocks which, themselves, could comfortably cover several megabytes, the addressing can scale to petabyte devices. The B-tree indexing scheme should allow large directories and large files to work just as well as small ones. The two terabyte limit on individual files might one day be a limit but that still seems a long way off. With the recent addition of fastmap for UBI, UBIFS would seem ready to scale to the biggest flash storage we have available. But it still requires raw flash access while a lot of flash devices force all access to pass through a flash translation layer. Could UBIFS still be useful on those devices?

Given that the UBI layer looks a lot like an FTL it seems reasonable to wonder if UBI could be modified slightly to talk to a regular block device instead, and allow it to talk to an SD card or similar. Could this provide useful performance?

Unfortunately such a conversion would be a little bit more than an afternoon's project. It would require:

  • Changing the expectation that all I/O is synchronous. This might be as simple as waiting immediately after submitting each request, but it would be better if true multi-threading could be achieved. Currently, UBIFS disables readahead because it is incompatible with a synchronous I/O interface.

  • Changing the expectation that byte-aligned reads are possible. UBIFS currently reads from a byte-aligned offset into a buffer, then decompresses from there. To work with the block layer it would be better to use a larger buffer that was sector-aligned, and then understand that the node read in would be found at an offset into that buffer, not at the beginning.

  • Changing the expectation that erased blocks read as all ones. When mounting a filesystem, UBIFS scans various erase blocks and assumes anything that isn't 0xFF is valid data. An FTL-based flash store will not provide that guarantee, so UBIFS would need to use a different mechanism to reliably detect dead data. This is not conceptually difficult but could be quite intrusive to the code.

  • Finding some way to achieve the same effect as the atomic LEB updates that UBI can provide. Again, a well understood problem, but possibly intrusive to fix.

So without a weekend to spare, that approach cannot be experimented with. Fortunately there is an alternative. As mentioned, there already exists a "block2mtd" driver which can be used to connect UBIFS, via UBI and mtd, to a block device. This driver in deliberately very simple and consequently quite inefficient. For example, it handles the mtd_erase() function by writing blocks full of 0xFF to the device. However, it turns out that it is only an afternoons project to modify it to allow for credible testing.

This patch modifies the block2mtd driver to handle mtd_erase() by recording the location of erased blocks in memory, return 0xFF for any read of an erased block, and not write out the PEB headers until real data is to be written to the PEB. The result of these changes is that the pattern of reads and, more importantly, writes to the block device will be much the same as the pattern of reads and writes expected from a more properly modified UBIFS. It is clearly not useful for real usage as important information is kept in memory, but it can provide a credible base for performance testing.

The obvious choice of what to test it against is f2fs. Having examined the internals of both f2fs and UBIFS, we have found substantial similarity which is hardly surprising as they have both been designed to work with flash storage. Both write whole erase blocks at a time where possible, both have several erase blocks "open" at once, and both make some efforts to collect similar data into the same erase blocks. There are of course differences though: UBIFS probably scales better to large directories, it can compress data being written, and it does not currently support exporting via NFS, partly because of the difficulty of providing a stable index for directory entries.

The compression support is probably most interesting. If the CPU is fast enough, compression might be faster than writing to flash and this could give UBIFS an edge in speed.

I performed some testing with f2fs and UBIFS; the latter was tested twice, with and without the use of compression (the non-compression case is marked below as "NC"). Just for interest's sake I've added NILFS2, ext4 and Btrfs. None of these are particularly designed for FTL based flash, though NILFS2 can align writes with the erase blocks and so might perform well. The results of the last two should be treated very cautiously. No effort was made to tune them to the device used, and all the results are based on writing to an empty device. For f2fs, UBIFS, and NILFS2 we know that they can "clean" the device so they always write to unused erase blocks. ext4 and Btrfs do not do the same cleaning so it is quite possible that the performance will degrade on a more "aged" filesystem. So the real long term values for these filesystems might be better, and might be worse, than what we see here.

For testing I used a new class 10 16GB microSD card, which claims 10MB/s throughput and seems to provide close to that for sequential IO. According to the flashbench tool, the card appears to have an 8MB erase block size; five erase blocks can be open at a time, and only the first erase block optimized for a PC-style file attribute table. The kernel used was 3.6.6 for openSUSE with the above mentioned patch and the v3 release of f2fs.

The tests performed were very simple. To measure small file performance, a tar archive of the Linux kernel (v3.7-rc6) was unpacked ten times and then — after unmounting and remounting — the files were read back in again and "du" and "rm -r" were timed to check metadata performance. The "rm -r" test was performed with a warm cache, immediately after the "du -a", which was performed on a cold cache. The average times in seconds for these operations were:

ubifsubifs — NCf2fsNILFS2ext4Btrfs
Write kernel72.4139.9118.4140.0135.593.6
Read kernel72.5129.6175.795.6108.8121.0
du -s9.98.748.64.44.413.8
rm -r0.480.450.3611.04.933.6

Some observations:

  • UBIFS, with compression, is clearly the winner at reading and writing small files. This test was run on an Intel Core i7 processor running at 1GHz; on a slower processor, the effect might not be as big. Without compression, UBIFS is nearly the slowest, which is a little surprising, but that could be due to the multiple levels that data passes though (UBI, MTD, block2mtd).

  • f2fs is surprisingly poor at simple metadata access (du -s). It is unlikely that this is due to the format chosen for the filesystem — the indirection of the Node Address Table is the only aspect of the design that could possibly cause this slowdown and it could explain at most a factor of two. This poor performance is probably some simple implementation issue. The number is stable across the ten runs, so it isn't just a fluke.

  • Btrfs is surprisingly fast at writing. The kernel source tree is about 500MB in size, so this is around 5.5MB/sec, which is well below what the device can handle but is still faster than anything else. This presumably reflects the performance-tuning efforts that the Btrfs team have made.

  • "rm -r" is surprisingly slow for the non-flash-focused filesystems, particularly Btrfs. The variance is high too. For ext4, the slowest "rm -r" took 32.4 seconds, while, for Btrfs, the slowest was 137.8 seconds — over 2 minutes. This seems to be one area where tuning the design for flash can be a big win.

So there is little here to really encourage spending that weekend to make UBIFS work well directly on flash. Except for the compression advantage, we are unlikely to do much better than f2fs, which can be used without that weekend of work. We would at least need to see how compression performs on the processor found in the target device before focusing too much on it.

As well as small files, I did some even simpler large-file tests. For this, I wrote and subsequently read two large, already compressed, files. One was an mp4 file with about one hour of video. The other was an openSUSE 12.2 install ISO image. Together they total about 6GB. The total times for each filesystem were:

ubifsubifs — NCf2fsNILFS2ext4Btrfs
write files 850 876 8381522 696863
read files 1684 1539 571574 571613

The conclusions here are a bit different:

  • Now ext4 is a clear winner on writes. It would be very interesting to work out why. The time translates to about 8.8MB/sec which is getting close to the theoretical maximum of 10MB/sec.

  • Conversely, NILFS2 is a clear loser, taking nearly twice as long as the other filesystems. Two separate runs showed similar results so it looks like there is room for some performance tuning here.

  • UBIFS is a clear loser on reads. This is probably because nodes are not aligned to sectors so some extra reading and extra copying is needed.

  • The ability for UBIFS to compress data clearly doesn't help with these large files. UBIFS did a little better with compression enabled, suggesting that the files were partly compressible, but it wasn't enough to come close to f2fs.

In summary, while f2fs appears to have room for improvement in some aspects of the implementation, there seems little benefit to be gained from pushing UBIFS into the arena of FTL-based devices. It will likely remain the best filesystem for raw flash, while f2fs certainly has some chance of positioning itself as the best filesystem for FTL-based flash. However, we certainly shouldn't write off ext4 or Btrfs. As noted earlier, these tests are not expected to give a firm picture of these two filesystems so we cannot read anything conclusive from them. However, it appears that both have something to offer, if only we can find a way to isolate that something.

Comments (21 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds