Kernel development

Brief items

Kernel release status

The current development kernel is 3.7-rc4, released on November 4. Linus Torvalds said: "Perhaps notable just because of the noise it caused in certain circles, there's the ext4 bitmap journaling fix for the issue that caused such a ruckus. It's a tiny patch and despite all the noise about it you couldn't actually trigger the problem unless you were doing crazy things with special mount options."

Stable status: 3.0.51, 3.4.18, and 3.6.6 were released on November 5. They all include important fixes. The latter two (3.4.18, 3.6.6) contain the fix for the ext4 corruption problem that initially looked much worse than it turned out to be.

Comments (none posted)

Quotes of the week

Who does not love the smell of formal methods first thing in the morning?

— Mel Gorman

TCP metrics code says "Yo dawg, I heard you like sizeof, so I did a sizeof of a sizeof, so you can size your size" Fix from Julian Anastasov.

— David Miller (Thanks to Benjamin Poirier)

Comments (none posted)

Kernel development news

Many more words on volatile ranges

By Michael Kerrisk
November 5, 2012

The volatile ranges feature provides applications that cache large amounts of data that they can (relatively easily) re-create—for example, browsers caching web content—with a mechanism to assist the kernel in making decisions about which pages can be discarded from memory when the system is under memory pressure. An application that wants to assist the kernel in this way does so by informing the kernel that a range of pages in its address space can be discarded at any time, if the kernel needs to reclaim memory. If the application later decides that it would like to use those pages, it can request the kernel to mark the pages nonvolatile. The kernel will honor that request if the pages have not yet been discarded. However, if the pages have already been discarded, the request will return an error, and it is then the application's responsibility to re-create those pages with suitable data.

Volatile ranges, take 12

John Stultz first proposed patches to implement volatile ranges in November 2011. As we wrote then, the proposed user-space API was via POSIX_FADV_VOLATILE and POSIX_FADV_NONVOLATILE operations for the posix_fadvise() system call. Since then, it appears that he has submitted at least another eleven iterations of his patch, incorporating feedback and new ideas into each iteration. Along the way, some feedback from David Chinner caused John to revise the API, so that a later patch series instead used the fallocate() system call, with two new flags, FALLOCATE_FL_MARK_VOLATILE and FALLOCATE_FL_UNMARK_VOLATILE.

The volatile ranges patches were also the subject of a discussion at the 2012 Linux Kernel Summit memcg/mm minisummit. What became clear there was that few of the memory management developers are familiar with John's patch set, and he appealed for more review of his work, since there were some implementation decisions that he didn't feel sufficiently confident to make on his own. As ever, getting sufficient review of patches is a challenge, and the various iterations of John's patches are a good case in point: several iterations of his patches received no or little substantive feedback.

Following the memcg/mm minisummit, John submitted a new round of patches, in an attempt to move this work further forward. His latest patch set begins with a lengthy discussion of the implementation and outlines a number of open questions.

The general design of the API is largely unchanged, with one notable exception. During the memcg/mm minisummit, John noted that repeatedly marking pages volatile and nonvolatile could be expensive, and was interested in ideas about how the kernel could do this more efficiently. Instead, Taras Glek (a Firefox developer) and others suggested an idea that could side-step the question of how to more efficiently implement the kernel operations: if a process attempts to access a volatile page that has been discarded from memory, then the kernel could generate a SIGBUS signal for the process. This would allow a process that wants to briefly access a volatile page to avoid the expense of bracketing the access with calls to unmark the page as volatile and then mark it as volatile once more.

Instead, the process would access the data, and if it received a SIGBUS signal, it would know that the data at the corresponding address needs to be re-created. The SIGBUS signal handler can obtain the address of the memory access that generated the signal via one of its arguments. Given that information, the signal handler can notify the application that the corresponding address range must be unmarked as volatile and repopulated with data. Of course, an application that doesn't want to deal with signals can still use the more expensive unmark/access/mark approach.

There are still a number of open questions regarding the API. As noted above, following Dave Chinner's feedback, John revised the interface to use the fallocate() system call instead of posix_fadvise(), only to have it suggested by other memory management maintainers at the memcg/mm minisummit that posix_fadvise() or madvise() would be better. The latest implementation still uses fallocate(), though John thinks his original approach of using posix_fadvise() is slightly more sensible. In any case, he is still seeking further input about the preferred interface.

The volatile ranges patches currently only support mappings on tmpfs filesystems, and marking or unmarking a range volatile requires the use of the file descriptor corresponding to the mapping. In his mail, John explained that Taras finds the file-descriptor-based interface rather cumbersome:

In talking with Taras Glek, he pointed out that for his needs, the fd based interface is a little annoying, as it requires having to get access to tmpfs file and mmap it in, then instead of just referencing a pointer to the data he wants to mark volatile, he has to calculate the offset from start of the mmap and pass those file offsets to the interface.

John acknowledged that an madvise() interface would be nice, but it raises some complexities. The problem with an madvise() interface is that it could be more generally applied to any part of the process's address space. However, John wondered what semantics could be attached to volatile ranges that are applied to anonymous mappings (e.g., when pages are duplicated via copy-on-write, should the duplicated page also be marked volatile?) or file-mappings on non-tmpfs filesystems. Therefore, the latest patch series provides only the file-descriptor-based interface.

There are a number of other subtle implementation details that John has considered in the volatile ranges implementation. For example, if a large page range is marked volatile, should the kernel perform a partial discard of pages in the range when under memory pressure, or discard the entire range? In John's estimation, discarding a subset of the range probably destroys the "value" of the entire range. So the approach taken is to discard volatile ranges in their entirety.

Then there is the question of how to treat volatile ranges that overlap and volatile ranges that are contiguous. Overlapping ranges are coalesced into a single range (which means they will be discarded as a unit). Contiguous ranges are slightly different. The current behavior is to merge them if neither range has yet been discarded. John notes that coalescing in these circumstances may not be desirable: since the application marked the ranges volatile in separate operations, it may not necessarily wish to see both ranges discarded together.

But at this point a seeming oddity of the current implementation intervenes: the volatile ranges implementation deals with address ranges at a byte level of granularity rather than at the page level. It is possible to mark (say) a page and half as volatile. The kernel will only discard complete volatile pages, but, if a set of contiguous sub-page ranges covering an entire page is marked volatile, then coalescing the contiguous ranges allows the page to be discarded if necessary. In response to this and various other points in John's lengthy mail, Neil Brown wondered if John was:

trying to please everyone and risked pleasing no-one… For example, allowing sub-page volatile region seems to be above and beyond the call of duty. You cannot mmap sub-pages, so why should they be volatile?

John responded that it seemed sensible from a user-space point of view to allow sub-page marking and it was not too complex to implement. However, the use case for byte-granularity volatile ranges is not obvious from the discussion. Given that the goal of volatile ranges is to assist the kernel in freeing up what would presumably be a significant amount of memory when the system is under memory pressure, it seems unlikely that a process would make multiple system calls to mark many small regions of memory volatile.

Neil also questioned the use of signals as a mechanism for informing user space that a volatile range has been discarded. The problem with signals, of course, is that their asynchronous nature means that they can be difficult to deal with in user-space applications. Applications that handle signals incorrectly can be prone to subtle race errors, and signals do not mesh well with some other parts of the user-space API, such as POSIX threads. John replied:

Initially I didn't like the idea, but have warmed considerably to it. Mainly due to the concern that the constant unmark/access/mark pattern would be too much overhead, and having a lazy method will be much nicer for performance. But yes, at the cost of additional complexity of handling the signal, marking the faulted address range as non-volatile, restoring the data and continuing.

There are a number of other unresolved implementation decisions concerning the order in which volatile range pages should be discarded when the system is under memory pressure, and John is looking for input on those decisions.

A good heuristic is required for choosing which ranges to discard first. The complicating factor here is that a volatile page range may contain both frequently and rarely accessed data. Thus, using the least recently used page in a range as a metric in the decision about whether to discard a range could cause quite recently used pages to be discarded. The Android ashmem implementation (upon which John's volatile ranges work is based) employed an approach to this problem that works well for Android: volatile ranges are discarded in the order in which they are marked volatile, and, since applications are not supposed to touch volatile pages, the least-recently-marked-volatile order provides a reasonable approximation of least-recently-used order.

But the SIGBUS semantics described above mean that an application could continue to access a memory region after marking it as volatile. Thus, the Android approach is not valid for John's volatile range implementation. In theory, the best solution might be to evaluate the age of the most recently used page in each range and then discard the range with the oldest most recently used page; John suspects, however, that there may be no efficient way of performing that calculation.

Then there is the question of the relative order of discarding for volatile and nonvolatile pages. Initially, John had thought that volatile ranges should be discarded in preference to any other pages on the system, since applications have made a clear statement that they can recover if the pages are lost. However, at the memcg/mm minisummit, it was pointed out that there may be pages on the system that are even better candidates for discarding, such as pages containing streamed data that is unlikely to be used again soon (if at all). However, the question of how to derive good heuristics for deciding the best relative order of volatile pages versus various kinds of nonvolatile pages remains unresolved.

One other issue concerns NUMA-systems. John's latest patch set uses a shrinker-based approach to discarding pages, which allows for an efficient implementation. However, (as was discussed at the memcg/mm minisummit) shrinkers are not currently NUMA-aware. As a result, when one node on a multi-node system is under memory pressure, volatile ranges on another node might be discarded, which would throw data away without relieving memory pressure on the node where that pressure is felt. This issue remains unresolved, although some ideas have been put forward about possible solutions.

Volatile anonymous ranges

In the thread discussing John's patch set, Minchan Kim raised a somewhat different use case that has some similar requirements. Whereas John's volatile ranges feature operates only on tmpfs mappings and requires the use of a file descriptor-based API, Minchan expressed a preference for an madvise() interface that could operate on anonymous mappings. And whereas John's patch set employs its own address-range based data structure for recording volatile ranges, Minchan proposed that volatility could be an implemented as a new VMA attribute, VM_VOLATILE, and madvise() would be used to set that attribute. Minchan thinks his proposal could be useful for user-space memory allocators.

With respect to John's concerns about copy-on-write semantics for volatile ranges in anonymous pages, Minchan suggested volatile pages could be discarded so long as all VMAs that share the page have the VM_VOLATILE attribute. Later in the thread, he said he would soon try to implement a prototype for his idea.

Minchan proved true to his word, and released a first version of his prototype, quickly followed by a second version, where he explained that his RFC patch complements John's work by introducing:

new madvise behavior MADV_VOLATILE and MADV_NOVOLATILE for anonymous pages. It's different with John Stultz's version which considers only tmpfs while this patch considers only anonymous pages so this cannot cover John's one. If below idea is proved as reasonable, I hope we can unify both concepts by madvise/fadvise.

Minchan detailed his earlier point about user-space memory allocators by saying that many allocators call munmap() when freeing memory that was allocated with mmap(). The problem is that munmap() is expensive. A series of page table entries must be cleaned up, and the VMA must be unlinked. By contrast, madvise(MAD_VOLATILE) only needs to set a flag in the VMA.

However, Andrew Morton raised some questions about Minchan's use case:

Presumably the userspace allocator will internally manage memory in large chunks, so the munmap() call frequency will be much lower than the free() call frequency. So the performance gains from this change might be very small.

The whole point of the patch is to improve performance, but we have no evidence that it was successful in doing that! I do think we'll need good quantitative testing results before proceeding with such a patch, please.

Paul Turner also expressed doubts about Minchan's rationale, noting that the tcmalloc() user-space memory allocator uses the madvise(MADV_DONTNEED) operation when discarding large blocks from free(). That operation informs the kernel that the pages can be (destructively) discarded from memory; if the process tries to access the pages again, they will either be faulted in from the underlying file, for a file mapping, or re-created as zero-filled pages, for the anonymous mappings that are employed by user-space allocators. Of course, re-creating the pages zero filled is normally exactly the desired behavior for a user-space memory allocator. In addition, MADV_DONTNEED is cheaper than munmap() and has the further benefit that no system call is required to reallocate the memory. (The only potential downside is that process address space is not freed, but this tends not to matter on 64-bit systems.)

Responding to Paul's point, Motohiro Kosaki pointed out that the use of MADV_DONTNEED in this scenario is sometimes the source of significant performance problems for the glibc malloc() implementation. However, he was unsure whether or not Minchan's patch would improve things.

Minchan acknowledged Andrew's questioning of the performance benefits, noting that his patch was sent out as a request for comment; he agreed with the need for performance testing to justify the feature. Elsewhere in the thread, he pointed to some performance measurements that accompanied a similar patch proposed some years ago by Rik van Riel; looking at those numbers, Minchan believes that his patch may provide a valuable optimization. At this stage, he is simply looking for some feedback about whether his idea warrants some further investigation. If his MADV_VOLATILE proposal can be shown to yield benefits, he hopes that his approach can be unified with John's work.

Conclusion

Although various people have expressed an interest in the volatile ranges feature, its progress towards the mainline kernel has been slow. That certainly hasn't been for want of effort by John, who has been steadily refining his well-documented patches and sending them out for review frequently. How that progress will be affected by Minchan's work remains to be seen. On the positive side, Minchan—assuming that his own work yields benefits—would like to see the two approaches integrated. However, that effort in itself might slow the progress of volatile ranges toward the mainline.

Given the user-space interest in volatile ranges, one supposes that the feature will eventually make its way into the kernel. But clearly, John's work, and eventually also Minchan's complementary work, could do with more review and input from the memory management developers to reach that goal.

Comments (28 posted)

UEFI secure boot kernel restrictions

By Jake Edge
November 7, 2012

The UEFI secure boot "problem" spans multiple levels in a Linux system. There are the actual "getting Linux to boot" issues, which have mostly been addressed by the two signed bootloaders that are available for distributions and users. Beyond that, though, are a set of kernel features that have the potential to subvert secure boot. Depending on one's perspective, those features either need to be configurable in the kernel—so some distributions can turn them off in their signed kernels—or they pose little risk beyond that of existing (but unknown) kernel bugs. As might be guessed, both sides of that argument can be heard in a recent linux-kernel thread.

The root problem, so to speak, is that the root user on Linux systems is trusted by the kernel. That means root can make all sorts of temporary—or permanent—changes to the state of the system. Those changes include things like using kexec() to boot a different operating system, or writing a hibernation image to the swap partition for use in a resume. But both of those things could be used by an attacker to circumvent the protections that secure boot is meant to enforce.

If, for example, a user were to boot using one of the Microsoft-signed Linux bootloaders—"shim" or "pre-bootloader"—into a kernel that didn't restrict root's powers, that kernel could arrange to execute a different kernel, perhaps one that has been compromised. Worse yet, from the perspective of those worried about Microsoft blacklisting bootloaders or keys, that second kernel could actually be a malicious version of Windows. So, a secure-boot-protected system would end up booting mal-Windows, which is precisely the scenario secure boot is supposed to prevent.

If that occurs in the wild, various folks believe that Microsoft will blacklist the bootloader that was used in the attack. If that's the same bootloader used to boot Linux, new systems, as well as old systems that get the blacklist update, will no longer boot Linux. Matthew Garrett (at least) is concerned about that scenario, so he has proposed kernel changes that would prevent suitably configured kernels from using kexec() among a handful of other restrictions that could be used to circumvent secure boot.

That was back in early September, and those changes were relatively uncontroversial except for the capability name Garrett chose (CAP_SECURE_FIRMWARE) and the kexec() restriction. In mid-September, he followed up with a set of patches that were substantially the same, though the kexec() patch was removed and the capability was renamed to CAP_COMPROMISE_KERNEL. In the patch, he noted that "if anyone wants to deploy these then they should disable kexec until support for signed kexec payloads has been merged".

Things went quiet for more than a month, but have since erupted into a rather large thread. A query from Jiri Kosina about loading firmware into a secure boot kernel led to a discussion of the threat model that is being covered by Garrett's patch set. While Garrett agreed that firmware loading should eventually be dealt with via signatures, it is not as high on his priority list. An automated attack using crafted firmware would be very hardware-specific, requiring reverse engineering, and "we'd probably benefit from them doing that in the long run".

Garrett's focus on automated attacks makes it clear what threat models he is trying to thwart, so Kosina's next query, about resuming from hibernation, is an issue that Garrett believes should be addressed. It turns out that Josh Boyer has a patch to disable hibernation for secure boot systems, but that, like disabling kexec(), was not overwhelmingly popular.

There are other ways to handle resuming from hibernation, for example by creating keys at boot time that get stored in UEFI boot variables and that the kernel uses to sign hibernation images. But it is clear that some kernel developers are starting (or continuing) to wonder if the kernel secure boot support isn't going a bit—or far more than a bit—overboard.

For one thing, as James Bottomley pointed out, there will always be kernel bugs that allow circumvention of these restrictions (e.g. by root reading the hibernation signing key or flipping the capability bit):

[...] a local privilege elevation attack usually exploits some bug (classically a buffer overflow) which executes arbitrary code in kernel context. In that case, the same attack vector can be used to compromise any in-kernel protection mechanism including turning off the secure boot capability and reading the in-kernel private signing key.

[...] The point I'm making is that given that the majority of exploits will already be able to execute arbitrary code in-kernel, there's not much point trying to consider features like this as attacker prevention. We should really be focusing on discussing why we'd want to prevent a legitimate local root from writing to the suspend partition in a secure boot environment.

But kernel exploits appear to be "off the table", at least in terms of the secure boot circumvention that Garrett and others are concerned about. Kosina said:

My understanding is that we are not trying to protect against root exploiting the kernel. We are trying to protect against root tampering with the kernel code and data through legitimate use of kernel-provided [facilities] (/dev/mem, ioperm, reprogramming devices to DMA to arbitrary memory locations, resuming from hibernation image that has been tampered with, etc).

It's not exactly clear why Microsoft would make a distinction between a kernel exploit and using legitimate kernel services when making a blacklisting decision, though. But, for distributions that do ship signed kernels, they can reduce the attack surface substantially: to only those kernels that they have signed, with whatever vulnerabilities are present in those particular versions.

Eric Paris detailed one possible attack that installs a crafted Linux boot environment (with a legitimately signed bootloader and kernel), which sleeps after setting up a trojaned Windows hibernation image. Users would need to wake the machine twice, but would end up running malware in a secure boot system.

Bottomley and others are, at the very least, uncomfortable with the idea of an "untrusted root". At the heart of the kernel changes for secure boot is removing the ability for root to make persistent changes to the boot environment. The patches that Garrett has proposed close many of the known holes that would allow root to make those kinds of changes, but the argument is that there are likely to be others. As Alan Cox put it:

With all the current posted RH patches I can still take over the box as root trivially enough and you seem to have so far abolished suspend to disk, kexec and a pile of other useful stuff. To actually lock it down you'll have to do a ton more of this.

Another possible way to handle Linux being used as an attack vector against Windows (which is how keys are likely to get blacklisted) is to change the behavior of the Linux bootloaders. Bottomley suggested that a "present user" test on the first boot of the bootloader, which could be detected because the UEFI key database and the "machine owner key" database do not contain the proper keys, would alleviate the problem. Garrett pointed out that the shim bootloader does not do this because it needs to be able to boot unattended, even on first boot. But, Bottomley saw that as unfortunate:

[...] what I'm telling you is that by deciding to allow automatic first boot, you're causing the windows attack vector problem. You could easily do a present user test only on first boot which would eliminate it. Instead, we get all of this.

Garrett, though, sees unattended first boot as an absolute requirement, especially for those who are trying to do automated installations for Linux systems. Others disagreed, not surprisingly, and the discussion still continues. It should be noted that the pre-bootloader that Bottomley released does do a present user test on first boot (and beyond, depending on whether the user changes the secure boot configuration).

There does seem to be something of whack-a-mole problem here in terms of finding all of the ways that this "untrusted root" might be able to impact secure boot. In addition, new kernel features will have to also be scrutinized to see whether they need to be disabled depending on CAP_COMPROMISE_KERNEL.

Not trusting root is a very different model than kernel developers (and users) are accustomed to. One can imagine that all of the different problem areas will be tracked down eventually, but it will be a fair amount of work. Whether that work is truly justified in support of a feature that is largely (though not completely) only useful for protecting Windows is a big question. On the other hand, not being able to [easily] boot Linux on x86 hardware because of key blacklisting would be problematic too. This one will take some time to play out.

Comments (44 posted)

A NILFS2 score card

November 7, 2012

This article was contributed by Neil Brown

A recurring theme in the comments on various articles announcing the new f2fs "flash-friendly" filesystem was that surely some other filesystem might already meet the need, or could be adjusted to meet the need, rather than creating yet another new filesystem. This is certainly an interesting question, but not one that is easy to answer. The cost/benefit calculation for creating a new filesystem versus enhancing an existing one involves weighing many factors, including motivational, social, political, and even, to some extent, technical. It is always more fun building something from scratch; trying to enter and then influence an existing community is at best unpredictable.

Of the various factors, the only ones for which there is substantial visibility to the outside observer are the technical factors, so while they may not be the major factors, they are the ones that this article will explore. In particular, we will examine "NILFS2", a filesystem which has been part of Linux for several years and is — superficially at least — one the best contenders as a suitable filesystem for modest-sized flash storage.

NILFS2 was not written primarily to be a flash-based filesystem, so comparing it head-to-head on that basis with f2fs (which was) might not be entirely fair. Instead we will examine it on its own merits, comparing it with f2fs occasionally to provide useful context, and ask "could this have made a suitable base for a flash-focused filesystem?"

`NILFS2`: what is it?

NILFS2 is the second iteration of the "New Implementation of a Log-structured File System". It is described as a "Continuous Snapshotting" filesystem, a feature which will be explored in more detail shortly.

NILFS2 appears to still be under development, with lots of core functionality present, but a number of important features still missing, such as extended attributes, quotas, and an fsck tool. As such, it is in a similar state to f2fs: well worth experimenting with, but not really ready for production usage yet.

In contrast with f2fs, NILFS2 uses 64 bits for all block addressing, 96 bits for time stamps (nanoseconds forever!), but only 16 bits for link counts (would you ever have more than 65535 links to a file, or sub-directories in one directory?). F2fs, in its initial release, uses 32 bits for each of these values.

While f2fs is a hybrid LFS (Log-structured Filesystem), using update-in-place in a number of cases, NILFS2 is a pure LFS. With the exception of the superblock (stored twice, once at either end of the device), everything is written in one continuous log. Data blocks are added to the log, then indexing information, then inodes, then indexes for the inodes, and so on. Occasionally a "super root" inode is written, from which all other blocks in the filesystem can be found. The address of the latest "super root" is stored in the superblock, along with static values for various parameters of the filesystem and a couple of other volatile values such as the number of free blocks.

Whenever a collection of blocks is written to the log, it is preceded by a segment summary which identifies all the blocks in the segment (similar to the Segment Summaries of f2fs which are stored in a separate area). Consecutive segment summaries are linked together so that, in the event of a crash, all the segment summaries since the most recent super root can be read and the state of the filesystem can be reconstructed.

The segment size can be chosen to be any number of blocks, which themselves must have a size that is a power of two up to the page size of the host system. The default block size is 4KB and the default device segment size is 8MB. Segments can easily be made to line up with erase blocks in a flash device, providing their size is known. While NILFS2 tries to write whole segments at a time, it is not always possible, so a number of consecutive partial segments might be written, each with their own segment summary block.

Being a pure LFS, NILFS2 will never write into the middle of an active segment — as f2fs does when space is tight. It insists on "cleaning" partially used segments (copying live data to a new segment) to make more space available, and does not even keep track of which particular blocks in a segment might be free. If there are no clean segments beyond those reserved for cleaning, the filesystem is considered to be full.

Everything is a file!

The statement "Everything is a file" is part of the Unix tradition and, like many such statements, it sounds good without tying you down to meaning very much. Each of "file", "everything" and even "is a" is open to some interpretation. If we understand "file" to be a collection of data and index blocks that provide some linearly addressed storage, "everything" to mean most data and metadata — excepting only the superblock and the segment summaries — and "is a" to mean "is stored inside a", then NILFS2 honors this Unix tradition.

For example, in a more traditional filesystem such as ext3, inodes (of which there is one per file) are stored at fixed locations in the device — usually a number of locations distributed across the address space, but fixed nonetheless. In f2fs, a hybrid approach is used where the addresses of the inodes are stored in fixed locations (in the Node Address Table — NAT), while the inodes themselves appear in the log, wherever is convenient. For NILFS2, the inode table is simply another file, with its own inode which describes the locations of the blocks.

This file (referred to as the ifile) also contains a bitmap allowing unused inodes to be found quickly, and a "block group descriptor" which allows non-empty bitmaps to be found quickly. With the default block size, every 2²⁵ inodes has 1024 blocks for bitmaps, and one descriptor block, which lists how many bits are set in each of those bitmaps. If you want more inodes than that, a second descriptor block will be automatically allocated.

The inodes themselves are a modest 128 bytes in size, and here I must confess to an oversimplification in the article on f2fs. The statement that "Copy-on-write is rather awkward for objects that are smaller than the block size" holds a grain of truth, but isn't really true as it stands. The reality is more subtle.

The advantages of a small inode size are primarily in efficiency. Less space can be wasted, and fewer I/O requests are needed to load the same number of inodes. For a traditional filesystem with pre-allocated inode regions, the space wasted can be a significant issue. However, that does not really apply to an LFS which allocates the space on demand. The speed issue is slightly harder to reason about. Certainly if all the inodes for files in one directory live in one block, then the common task of running "ls -l" will be expedited. However if more information, such as extended attributes or file data for small files, is stored in a big inode, accessing that will require only one block to be read, not two.

The advantages of a block-sized inode — apart from the extra space, which is of uncertain value — is that inodes can be updated independently. OCFS2 (a cluster-based filesystem) uses this to simplify the locking overhead — a cluster node does not need to gain exclusive access to inodes in the same block as the one that it is interested in when it performs an update, because there aren't any. In an LFS, the main issue is reducing cleaning overhead. As we noted in the f2fs article, grouping data with similar life expectancy tends to reduce the expected cost of cleaning, so storing an inode together with the data that it refers to is a good idea. If there are several inodes in the one block, then the life expectancy will be the minimum for all the inodes, and so probably quite different from nearby data. This could impose some (hard to measure) extra cleaning cost.

On the whole, it would seem best for an LFS if the one-inode-per-block model were used, as there is minimal cost of wasted space and real opportunities for benefits. If ways are found to make maximal use of that extra space, possibly following some ideas that Arnd Bergmann recently suggested, then block-sized inodes would be even more convincing.

Small inodes might be seen as a reason not to choose NILFS2, though not a very strong reason. Adjusting NILFS2 to allow full-block inodes would not be a large technical problem, though it is unknown what sort of social problem it might be.

As with most filesystems, NILFS2 also stores each directory in a "file" so there are no surprises there. The surprise is that the format used is extremely simple. NILFS2 directories are nearly identical to ext2 directories, the only important difference being that they store a 64-bit inode number rather than just 32 bits. This means that any lookup requires a linear search through the directory. For directories up to about 400 entries (assuming fairly short names on average), this is no different from f2fs. For very large directories, the search time increases linearly for NILFS2, while it is logarithmic for f2fs. While f2fs is not particularly efficient at this task, NILFS2 clearly hasn't made any effort at effective support for large directories. There appears to be an intention to implement some sort of B-tree based directory structure in the future, but this has not yet happened.

Use the right tree for the job

If everything is a file, then it is clearly important to know what a file is. It starts at the inode which contains space for seven 64-bit addresses. When the file is small (seven blocks or less) these contain pointers to all of the allocated blocks. When the file is larger, this space changes to become the root of a B-tree, with three keys (file addresses), three pointers to other B-tree nodes, and a small header.

The interesting thing about this B-tree is that the leaves do not contain extents describing contiguous ranges of blocks; instead they describe each block individually. This is interesting because it does not fit the typical use-case for a B-tree.

The particular value of a B-tree is that it remains balanced independently of the ordering or spacing of keys being inserted or removed. There is a cost that blocks may occasionally need to be split or merged, but this is more than compensated for by the ability to add an unpredictable sequence of keys. When extents are being stored in a tree, it is not possible to predict how long each extent will be, or when an extent might get broken, so the sequence of keys added will be unpredictable and a B-tree is ideal.

When the keys being added to the index are the offsets of consecutive blocks, then the sequence is entirely predictable and a different tree is likely to be preferred. A radix tree (where the path through the tree is a simple function of the key value) is much more compact than a B-tree (as there is no need to store keys) and much simpler to code. This is the sort of tree chosen for f2fs, the tree used for ext3, and generally the preferred choice when block extents are not being used to condense the index.

The only case where a B-tree of individual blocks might be more efficient than a radix tree is where the file is very sparse, having just a few allocated blocks spread throughout a large area that is unallocated. Sparse files are simply not common enough among regular files to justify optimizing for them. Nor are many of the special files that NILFS2 uses likely to be sparse. The one exception is the checkpoint file (described later), and optimizing the indexing strategy for that one file is unlikely to have been a motivation.

So we might ask "why?". Why does NILFS2 use a B-tree, or why does it not use extents in its addressing? An early design document [PDF] suggests that B-trees were chosen due to their flexibility, and while it isn't yet clear that the flexibility is worth the cost, future developments might show otherwise. The lack of extent addressing can be explained with a much more concrete answer once we understand one more detail about file indexing.

Another layer of indirection

The headline feature for NILFS2 is "continuous snapshotting". This means that it takes a snapshot of the state of the filesystem "every few seconds". These are initially short-term snapshots (also called "checkpoints"), and can be converted to long-term snapshots, or purged, by a user-space process following a locally configurable policy. This means there are very likely to be lots of active snapshots at any one time.

As has been mentioned, the primary cost of an LFS is cleaning — the gathering of live data from nearly-empty segments to other segments so that more free space can be made available. When there is only one active filesystem, each block moved only requires one index to be updated. However, when there are tens or hundreds of snapshots, each block can be active in a fairly arbitrary sub-sequence of these, so relocating a block could turn into a lot of work in updating indices.

Following the maxim usually attributed to David Wheeler, this is solved by adding a level of indirection. One of the special files that NILFS2 uses is known as the "DAT" or "Disk Address Translation" file. It is primarily an array of 64-bit disk addresses, though the file also contains allocation bitmaps like the ifile, and each entry is actually 256 bits as there is also a record of which checkpoints the block is active in. The addresses in the leaves of the indexing trees for almost all files are not device addresses but are, instead, indexes into this array. The value found contains the actual device address. This allows a block to be relocated by simply moving the block and updating this file. All snapshots will immediately know where the new location is. Doing this with variable length extents would be impractical, which appears to be one of the main reasons that NILFS2 doesn't use them.

It should be noted that while this DAT file is similar to the NAT used by f2fs, it is different in scale. The f2fs NAT is used only to provide indirection when looking up nodes — inodes, index blocks, etc. — not when looking up data blocks. The DAT file is used for lookup of all blocks. This indirection imposes some cost on every access.

Estimating that cost is, again, not easy. Given a 4K block size, each block in the DAT file provides indexing for 128 other blocks. This imposes approximately a 1% overhead in storage space and at least a 1% overhead in throughput. If all the DAT entries for a given file are adjacent, the overhead will be just that 1%. If they are very widely spread out, it could be as much as 100% (if each DAT entry is in a different block of the DAT file). Files that are created quickly on a fresh filesystem will tend to the smaller number, files created slowly (like log files) on a well-aged filesystem will likely tend toward a larger number. An average of 3% or 4% probably wouldn't be very surprising, but that is little more than a wild guess.

Against this cost we must weigh the benefit, which is high frequency snapshots. While I have no experience with this feature, I do have experience with the "undo" feature in text editors. In my younger days I used "ed" and don't recall being upset by the lack of an undo feature. Today I use emacs and use undo all the time — I don't know that I could go back to using an editor without this simple feature. I suspect continual snapshots are similar. I don't miss what I don't have, but I could quickly get used to them.

So: is the availability of filesystem-undo worth a few percent in performance? This is a question I'll have to leave to my readers to sort out. To make it easier to ponder, I'll relieve your curiosity and clarify why not all files use the DAT layer of indirection. The answer is of course that the DAT file itself cannot use indirection as it would then be impossible to find anything. Every other file does use the DAT and every lookup in those files will involve indirection.

Other miscellaneous metadata

NILFS2 has two other metadata files. Both of these are simple tables of data without the allocation bitmaps of the DAT file and the ifile.

The "sufile" records the usage of each segment of storage, counting the number of active blocks in the segment, and remembering when the segment was written. The former is used to allow the segment to be reused when it reaches zero. The latter is used to guide the choice of which segment to clean next. If a segment is very old, there is not much point waiting for more blocks to die of natural attrition. If it is young, it probably is worthwhile to wait a bit longer.

The "cpfile" records checkpoints and every time a checkpoint is created a new record is added to the end of this file. This record stores enough information to reconstruct the state of all files at the time of the checkpoint. In particular, this includes the inode for the ifile. Left to itself, this file would grow without bound. However, in normal operation a user-space program (nilfs_cleanerd) will monitor usage and delete old checkpoints as necessary. This results in the cpfile becoming a sparse file with lots of empty space for most of the length of the file, a dense collection of records at the end, and an arbitrary number of individual blocks sprinkled throughout (for the long-term snapshots). This is the file for which a radix-tree index may not be the optimal indexing scheme. It isn't clear that would matter though.

Pros and cons

So now we must return to the question: from a technical perspective, would NILFS2 make a suitable base of a flash-optimized filesystem? The principal property of flash, that the best writes are sequential writes aligned to the underlying erase block size, is easily met by NILFS2, making it a better contender than filesystems with lots of fixed locations, but can we gain any insight by looking at the details?

One of the several flash-focused features of f2fs is that it has several segments open for writing at a time. This allows data with different life expectancies to be kept separate, and also improves the utilization of those flash devices that allow a degree of parallelism in access. NILFS2 only has a single segment open at a time, as is probably appropriate for rotating media with a high seek cost, and makes no effort to sort blocks based on their life expectancy. Adding these to NILFS2 would be possible, but it is unlikely that it would be straightforward.

Looking at the more generally applicable features of NILFS2, the directory structure doesn't scale, the file indexing is less than optimal, and the addressing indirection imposes a cost of uncertain size. On the whole, there seems to be little to recommend it and a substantial amount of work that would be required to tune it to flash in the way that f2fs has been tuned. It gives the impression of being a one-big-idea filesystem. If you want continual snapshots, this is the filesystem for you. If not, it holds nothing of real interest.

On the other side, f2fs comes across as a one-big-idea filesystem too. It is designed to interface with the FTL (Flash translation layer) found in today's flash devices, and provides little else of real interest, providing no snapshots and not working at all well with any storage device other than flash. Could this be a sign of a trend towards single-focus filesystems? And if so, is that a good thing, or a bad thing?

So, while NILFS2 could have been used as a starting point for a flash-focused filesystem, it is not at all clear that it would have been a good starting point, and it is hard to challenge the decision to create a new filesystem from scratch. Whether some other filesystem might have made a better start will have to be a question for another day.

Comments (18 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.7-rc4 ?

Greg KH Linux 3.6.6 ?

Thomas Gleixner 3.6.5-rt14 ?

Thomas Gleixner 3.6.5-rt15 brown paper bag release ?

Greg KH Linux 3.4.18 ?

Steven Rostedt 3.4.17-rt28 ?

Steven Rostedt 3.4.15-rt26 ?

Steven Rostedt 3.2.33-rt50 ?

Steven Rostedt 3.2.32-rt49 ?

Greg KH Linux 3.0.51 ?

Steven Rostedt 3.0.50-rt74 ?

Steven Rostedt 3.0.48-rt72 ?

Architecture-specific

Kees Cook arch/arm: support seccomp ?

Build system

Kees Cook remove CONFIG_EXPERIMENTAL ?

Core kernel code

Steven Rostedt tracing: Using recursion bits for function tracing and ring buffer ?

Tejun Heo cgroup_freezer: implement proper hierarchy support ?

Alex Shi power aware scheduling ?

Development tools

Luming Yu A simple hardware detector for latency as well as throughtput ver 0.10 ?

Stephane Eranian perf: add memory access sampling support ?

Jan Kiszka Add gdb python scripts as kernel debugging helpers ?

Device drivers

Jason Wang Multiqueue support in tuntap ?

Jiang Liu introduce ACPI based system device hotplug driver ?

Philip, Avinash Support for AM33xx PWM Susbsytem ?

Jon Mason PCI-Express Non-Transparent Bridge Support ?

Filesystems and block I/O

Darrick J. Wong [RFC PATCH v2 0/3] mm/fs: Implement faster stable page writes on filesystems ?

Jeff Layton vfs: add the ability to retry lookup and operation to most path-based syscalls ?

Memory management

Wen Congyang memory-hotplug: hot-remove physical memory ?

Glauber Costa kmem controller for memcg. ?

Michel Lespinasse mm: use augmented rbtrees for finding unmapped areas ?

Mel Gorman Foundation for automatic NUMA balancing ?

Srivatsa S. Bhat [RFC PATCH 00/10][Hierarchy] mm: Linux VM Infrastructure to support Memory Power Management ?

Rafael Aquini make balloon pages movable by compaction ?

Virtualization and containers

Sjur Brændeland Introduce CAIF Virtio and reversed Vrings ?

George Zhang VMCI for Linux upstreaming ?

George Zhang VSOCK for Linux upstreaming ?

Page editor: Jonathan Corbet
Next page: Distributions>>