Brief items
The current development kernel is 3.7-rc4, released on November 4. Linus Torvalds said:
"Perhaps notable just because of the noise it caused in certain
circles, there's the ext4 bitmap journaling fix for the issue that caused
such a ruckus. It's a tiny patch and despite all the noise about it you
couldn't actually trigger the problem unless you were doing crazy things
with special mount options."
Stable status: 3.0.51, 3.4.18, and 3.6.6 were released on November 5. They all
include important fixes. The latter two (3.4.18,
3.6.6) contain the fix for the ext4 corruption
problem that initially looked much worse than it turned out to be.
Comments (none posted)
Who does not love the smell of formal methods first thing in the
morning?
—
Mel Gorman
TCP metrics code says "Yo dawg, I heard you like sizeof, so I did a
sizeof of a sizeof, so you can size your size" Fix from Julian
Anastasov.
—
David Miller (Thanks to Benjamin Poirier)
Comments (none posted)
Kernel development news
By Michael Kerrisk
November 5, 2012
The volatile ranges feature provides applications that cache large
amounts of data that they can (relatively easily) re-create—for
example, browsers caching web content—with a mechanism to assist the
kernel in making decisions about which pages can be discarded from memory
when the system is under memory pressure. An application that wants to
assist the kernel in this way does so by informing the kernel that a range
of pages in its address space can be discarded at any time, if the kernel
needs to reclaim memory. If the application later decides that it would
like to use those pages, it can request the kernel to mark the pages
nonvolatile. The kernel will honor that request if the pages have not yet
been discarded. However, if the pages have already been discarded, the
request will return an error, and it is then the application's
responsibility to re-create those pages with suitable data.
Volatile ranges, take 12
John Stultz first proposed patches to implement volatile ranges in November
2011. As we wrote then, the proposed
user-space API was via POSIX_FADV_VOLATILE and
POSIX_FADV_NONVOLATILE operations for the posix_fadvise()
system call. Since then, it appears that he has submitted at least another
eleven iterations of his patch, incorporating feedback and new ideas into
each iteration. Along the way, some feedback from David Chinner caused John to
revise the API, so that a later patch
series instead used the fallocate() system call, with two new
flags, FALLOCATE_FL_MARK_VOLATILE and
FALLOCATE_FL_UNMARK_VOLATILE.
The volatile ranges patches were also the subject of a discussion at the 2012 Linux Kernel Summit
memcg/mm minisummit. What became clear there was that few of the memory
management developers are familiar with John's patch set, and he
appealed for more review of his work, since there were some implementation
decisions that he didn't feel sufficiently confident to make on his own.
As ever, getting sufficient review of patches is a challenge, and the
various iterations of John's patches are a good case in point: several
iterations of his patches received no or little substantive feedback.
Following the memcg/mm minisummit, John submitted a new round of
patches, in an attempt to move this work further forward. His latest patch set begins with a lengthy
discussion of the implementation and outlines a number of open questions.
The general design of the API is largely unchanged, with one notable
exception. During the memcg/mm minisummit, John noted that repeatedly
marking pages volatile and nonvolatile could be expensive, and was
interested in ideas about how the kernel could do this more efficiently.
Instead, Taras Glek (a Firefox developer) and others suggested an
idea that could side-step the question of how to more efficiently implement
the kernel operations: if a process attempts to access a volatile page that
has been discarded from memory, then the kernel could generate a
SIGBUS signal for the process. This would allow a process that
wants to briefly access a volatile page to avoid the expense of bracketing
the access with calls to unmark the page as volatile and then mark it as
volatile once more.
Instead, the process would access the data, and if it received a
SIGBUS signal, it would know that the data at the corresponding
address needs to be re-created. The SIGBUS signal handler can
obtain the address of the memory access that generated the signal via one
of its arguments. Given that information, the signal handler can notify the
application that the corresponding address range must be unmarked as
volatile and repopulated with data. Of course, an application that doesn't
want to deal with signals can still use the more expensive
unmark/access/mark approach.
There are still a number of open questions regarding the API. As noted
above, following Dave Chinner's feedback, John revised the interface to use
the fallocate() system call instead of posix_fadvise(),
only to have it suggested by other memory management maintainers at the
memcg/mm minisummit that posix_fadvise() or madvise()
would be better. The latest implementation still uses fallocate(),
though John thinks his original approach of using posix_fadvise()
is slightly more sensible. In any case, he is still seeking further input
about the preferred interface.
The volatile ranges patches currently only support mappings on tmpfs filesystems, and
marking or unmarking a range volatile requires the use of the file
descriptor corresponding to the mapping. In his mail, John explained that
Taras finds the file-descriptor-based interface rather cumbersome:
In talking with Taras Glek, he pointed out that for his needs, the fd based
interface is a little annoying, as it requires having to get access to
tmpfs file and mmap it in, then instead of just referencing a pointer to
the data he wants to mark volatile, he has to calculate the offset from
start of the mmap and pass those file offsets to the interface.
John acknowledged that an madvise() interface would be nice,
but it raises some complexities. The problem with an madvise()
interface is that it could be more generally applied to any part of the
process's address space. However, John wondered what semantics could be
attached to volatile ranges that are applied to anonymous mappings (e.g.,
when pages are duplicated via copy-on-write, should the duplicated page also
be marked volatile?) or
file-mappings on non-tmpfs filesystems. Therefore, the latest
patch series provides only the file-descriptor-based interface.
There are a number of other subtle implementation details that John has
considered in the volatile ranges implementation. For example, if a large
page range is marked volatile, should the kernel perform a partial discard
of pages in the range when under memory pressure, or discard the entire
range? In John's estimation, discarding a subset of the range probably
destroys the "value" of the entire range. So the approach taken is to
discard volatile ranges in their entirety.
Then there is the question of how to treat volatile ranges that overlap
and volatile ranges that are contiguous. Overlapping ranges are coalesced
into a single range (which means they will be discarded as a
unit). Contiguous ranges are slightly different. The current behavior is to
merge them if neither range has yet been discarded. John notes that
coalescing in these circumstances may not be desirable: since the
application marked the ranges volatile in separate operations, it may not
necessarily wish to see both ranges discarded together.
But at this point a seeming oddity of the current implementation
intervenes: the volatile ranges implementation deals with address ranges at
a byte level of granularity rather than at the page level. It is possible
to mark (say) a page and half as volatile. The kernel will only discard
complete volatile pages, but, if a set of contiguous sub-page ranges
covering an entire page is marked volatile, then coalescing the contiguous
ranges allows the page to be discarded if necessary. In response to this
and various other points in John's lengthy mail, Neil Brown wondered if John was:
trying to please everyone and risked pleasing no-one… For example,
allowing sub-page volatile region seems to be above and beyond the call of
duty. You cannot mmap sub-pages, so why should they be volatile?
John responded that it seemed sensible from a user-space point of view
to allow sub-page marking and it was not too complex to implement. However,
the use case for byte-granularity volatile ranges is not obvious from the
discussion. Given that the goal of volatile ranges is to assist the kernel
in freeing up what would presumably be a significant amount of memory when
the system is under memory pressure, it seems unlikely that a process would
make multiple system calls to mark many small regions of memory volatile.
Neil also questioned the use of signals as a mechanism for informing
user space that a volatile range has been discarded. The problem with
signals, of course, is that their asynchronous nature means that they can
be difficult to deal with in user-space applications. Applications that
handle signals incorrectly can be prone to subtle race errors, and signals
do not mesh well with some other parts of the user-space API, such as POSIX
threads. John replied:
Initially I didn't like the idea, but have warmed considerably to it.
Mainly due to the concern that the constant unmark/access/mark pattern
would be too much overhead, and having a lazy method will be much nicer
for performance. But yes, at the cost of additional complexity of
handling the signal, marking the faulted address range as non-volatile,
restoring the data and continuing.
There are a number of other unresolved implementation decisions
concerning the order in which volatile range pages should be discarded when
the system is under memory pressure, and John is looking for input on those
decisions.
A good heuristic is required for choosing which ranges to discard
first. The complicating factor here is that a volatile page range may
contain both frequently and rarely accessed data. Thus, using the least
recently used page in a range as a metric in the decision about whether to
discard a range could cause quite recently used pages to be discarded. The
Android ashmem implementation (upon which John's volatile ranges work
is based) employed an approach to this problem that works well for Android:
volatile ranges are discarded in the order in which they are marked
volatile, and, since applications are not supposed to touch volatile pages,
the least-recently-marked-volatile order provides a reasonable
approximation of least-recently-used order.
But the SIGBUS semantics described above mean that an
application could continue to access a memory region after marking it as
volatile. Thus, the Android approach is not valid for John's volatile range
implementation. In theory, the best solution might be to evaluate the age
of the most recently used page in each range and then discard the range
with the oldest most recently used page; John suspects, however, that there
may be no efficient way of performing that calculation.
Then there is the question of the relative order of discarding for
volatile and nonvolatile pages. Initially, John had thought that volatile
ranges should be discarded in preference to any other pages on the system,
since applications have made a clear statement that they can recover if the
pages are lost. However, at the memcg/mm minisummit, it was pointed out
that there may be pages on the system that are even better candidates for
discarding, such as pages containing streamed data that is unlikely to be
used again soon (if at all). However, the question of how to derive good
heuristics for deciding the best relative order of volatile pages versus
various kinds of nonvolatile pages remains unresolved.
One other issue concerns NUMA-systems. John's latest patch set uses a
shrinker-based approach to discarding pages, which allows for an efficient
implementation. However, (as was discussed at
the memcg/mm minisummit) shrinkers are not currently NUMA-aware. As a
result, when one node on a multi-node system is under memory pressure,
volatile ranges on another node might be discarded, which would throw data
away without relieving memory pressure on the node where that pressure is
felt. This issue remains unresolved, although some ideas have been put
forward about possible solutions.
Volatile anonymous ranges
In the thread discussing John's patch set, Minchan Kim raised a somewhat different use case that has
some similar requirements. Whereas John's volatile ranges feature operates
only on tmpfs mappings and requires the use of a file
descriptor-based API, Minchan expressed a preference for an
madvise() interface that could operate on anonymous mappings. And
whereas John's patch set employs its own address-range based data structure
for recording volatile ranges, Minchan proposed that volatility could be an
implemented as a new VMA attribute, VM_VOLATILE, and
madvise() would be used to set that attribute. Minchan thinks his
proposal could be useful for user-space memory allocators.
With respect to John's concerns about copy-on-write semantics
for volatile ranges in anonymous pages, Minchan suggested volatile pages
could be discarded so long as all VMAs that share the page have the
VM_VOLATILE attribute. Later in the thread, he said he would soon
try to implement a prototype for his idea.
Minchan proved true to his word, and released a first version of his
prototype, quickly followed by a second
version, where he explained that his RFC patch complements John's work
by introducing:
new madvise behavior MADV_VOLATILE and MADV_NOVOLATILE for anonymous
pages. It's different with John Stultz's version which considers only tmpfs
while this patch considers only anonymous pages so this cannot cover John's
one. If below idea is proved as reasonable, I hope we can unify both
concepts by madvise/fadvise.
Minchan detailed his earlier point about user-space memory allocators
by saying that many allocators call munmap() when
freeing memory that was allocated with mmap(). The
problem is that munmap() is expensive. A series of page table
entries must be cleaned up, and the VMA must be unlinked. By contrast,
madvise(MAD_VOLATILE) only needs to set a flag in the VMA.
However, Andrew Morton raised some
questions about Minchan's use case:
Presumably the userspace allocator will internally manage memory in
large chunks, so the munmap() call frequency will be much lower than the
free() call frequency. So the performance gains from this change might be
very small.
The whole point of the patch is to improve performance, but we have no
evidence that it was successful in doing that! I do think we'll need good
quantitative testing results before proceeding with such a patch, please.
Paul Turner also expressed doubts about
Minchan's rationale, noting that the tcmalloc()
user-space memory allocator uses the madvise(MADV_DONTNEED)
operation when discarding large blocks from free(). That operation
informs the kernel that the pages can be (destructively) discarded from
memory; if the process tries to access the pages again, they will either be
faulted in from the underlying file, for a file mapping, or re-created as
zero-filled pages, for the anonymous mappings that are employed by
user-space allocators. Of course, re-creating the pages zero filled is
normally exactly the desired behavior for a user-space memory allocator. In
addition, MADV_DONTNEED is cheaper than munmap() and has
the further benefit that no system call is required to reallocate the
memory. (The only potential downside is that process address space is not
freed, but this tends not to matter on 64-bit systems.)
Responding to Paul's point, Motohiro Kosaki pointed out that the use of
MADV_DONTNEED in this scenario is sometimes the source of
significant performance problems for the glibc malloc()
implementation. However, he was unsure whether or not Minchan's patch would
improve things.
Minchan acknowledged Andrew's questioning of the performance benefits,
noting that his patch was sent out as a request for comment; he agreed with
the need for performance testing to justify the feature. Elsewhere in the thread, he pointed to some
performance measurements that accompanied a similar patch proposed some
years ago by Rik van Riel; looking at those numbers, Minchan believes that
his patch may provide a valuable optimization. At this stage, he is simply
looking for some feedback about whether his idea warrants some further
investigation. If his MADV_VOLATILE proposal can be shown to yield
benefits, he hopes that his approach can be unified with John's work.
Conclusion
Although various people have expressed an interest in the volatile
ranges feature, its progress towards the mainline kernel has been
slow. That certainly hasn't been for want of effort by John, who has been
steadily refining his well-documented patches and sending them out for
review frequently. How that progress will be affected by Minchan's work
remains to be seen. On the positive side, Minchan—assuming that his
own work yields benefits—would like to see the two approaches
integrated. However, that effort in itself might slow the progress of
volatile ranges toward the mainline.
Given the user-space interest in volatile ranges, one supposes that the
feature will eventually make its way into the kernel. But clearly, John's
work, and eventually also Minchan's complementary work, could do with more
review and input from the memory management developers to reach that goal.
Comments (28 posted)
By Jake Edge
November 7, 2012
The UEFI secure boot "problem" spans multiple levels in a Linux system.
There are the actual "getting Linux to boot" issues, which have mostly been addressed by
the two signed bootloaders that are
available for distributions and users. Beyond that, though, are a set of
kernel features that have the potential to subvert secure boot. Depending
on one's perspective, those features either need to be configurable in the
kernel—so some distributions can turn them off in their signed
kernels—or they pose little risk beyond that of existing (but
unknown) kernel bugs. As might be guessed, both sides of
that argument can be heard in a recent linux-kernel thread.
The root problem, so to speak, is that the root user on Linux systems is
trusted by the kernel. That means root can make all sorts of
temporary—or permanent—changes to the state of the system.
Those changes include things like using kexec() to boot a
different operating system, or writing a hibernation image to the swap
partition for use in a resume. But both of those things could be used by
an attacker to circumvent the protections that secure boot is meant to enforce.
If, for example, a user were to boot using one of the Microsoft-signed
Linux bootloaders—"shim" or "pre-bootloader"—into a kernel that didn't restrict root's powers, that
kernel could arrange to execute a different kernel, perhaps one that has
been compromised. Worse yet, from the perspective of those worried about
Microsoft blacklisting bootloaders or keys, that second kernel could
actually be a malicious version of Windows. So, a secure-boot-protected
system would end up booting mal-Windows, which is precisely the scenario
secure boot is supposed to prevent.
If that occurs in the wild, various folks believe that Microsoft will
blacklist the bootloader that was used in the attack. If that's the same
bootloader used to boot Linux, new systems, as well as old systems that get
the blacklist update, will no longer boot Linux. Matthew Garrett (at
least) is concerned about that scenario, so he has proposed kernel changes that would prevent
suitably configured kernels from using kexec() among a handful of
other restrictions that could be used to circumvent secure boot.
That was back in early September, and those changes were relatively
uncontroversial except for the capability name Garrett chose
(CAP_SECURE_FIRMWARE) and the kexec() restriction. In
mid-September, he followed up with a set of
patches that were substantially the same, though the kexec()
patch was removed and the capability was renamed to
CAP_COMPROMISE_KERNEL. In the patch, he noted that "if
anyone wants
to deploy these then they should disable kexec until support for signed
kexec payloads has been merged".
Things went quiet for more than a month, but have since erupted into a
rather large thread. A query from Jiri
Kosina about loading firmware into a secure boot kernel led to a
discussion of the threat model that is being covered by Garrett's patch
set. While Garrett agreed that firmware
loading should eventually be dealt with via signatures, it is not as high
on his priority list. An automated attack using crafted firmware would be
very hardware-specific, requiring reverse engineering, and "we'd
probably benefit from them doing that in the long run".
Garrett's focus on automated attacks makes it clear what threat models he
is trying to thwart, so Kosina's next query, about resuming from hibernation, is an issue
that Garrett believes should be addressed. It turns out that Josh Boyer
has a patch to
disable hibernation for secure boot systems, but that, like disabling
kexec(), was not overwhelmingly popular.
There are other ways to handle resuming from hibernation, for example by creating keys at boot
time that get stored in UEFI boot variables and that the kernel uses to sign
hibernation images. But it is clear that some kernel developers are
starting (or continuing) to wonder if the kernel secure boot support isn't
going a bit—or far more than a bit—overboard.
For one thing, as James Bottomley pointed
out, there will always be kernel bugs that allow circumvention of these
restrictions (e.g. by root reading the hibernation signing key or flipping
the
capability bit):
[...] a local privilege elevation attack
usually exploits some bug (classically a buffer overflow) which executes
arbitrary code in kernel context. In that case, the same attack vector
can be used to compromise any in-kernel protection mechanism including
turning off the secure boot capability and reading the in-kernel private
signing key.
[...] The point I'm making is that given that the majority of exploits will
already be able to execute arbitrary code in-kernel, there's not much
point trying to consider features like this as attacker prevention. We
should really be focusing on discussing why we'd want to prevent a
legitimate local root from writing to the suspend partition in a secure
boot environment.
But kernel exploits appear to be "off the table", at least in terms of the
secure boot circumvention that Garrett and others are concerned about.
Kosina said:
My understanding is that we are not trying to protect against root
exploiting the kernel. We are trying to protect against root tampering
with the kernel code and data through legitimate use of kernel-provided
[facilities] (/dev/mem, ioperm, reprogramming devices to DMA to arbitrary
memory locations, resuming from hibernation image that has been tampered
with, etc).
It's not exactly clear why Microsoft would make a distinction between a
kernel exploit and using legitimate kernel services when making a
blacklisting
decision, though. But, for distributions that do ship signed kernels, they
can reduce the attack surface substantially: to only those kernels that
they have signed, with whatever vulnerabilities are present in those
particular versions.
Eric Paris detailed one possible attack that
installs a crafted Linux boot environment (with a legitimately signed
bootloader and kernel), which
sleeps after setting up a trojaned Windows hibernation image. Users would
need to wake the machine twice, but would end up running malware in a
secure boot system.
Bottomley and others are, at the very least, uncomfortable with the
idea of an "untrusted root". At the heart of the kernel changes for secure
boot is removing the ability for root to make persistent changes to the boot
environment. The patches that Garrett has proposed close many of the known
holes that would allow root to make those kinds of changes, but the
argument is that there are likely to be others. As Alan Cox put it:
With all the current posted RH patches I can still take over
the box as root trivially enough and you seem to have so far abolished
suspend to disk, kexec and a pile of other useful stuff. To actually lock
it down you'll have to do a ton more of this.
Another possible way to handle Linux being used as an attack vector against
Windows (which is how keys are likely to get blacklisted) is to change the
behavior of the Linux bootloaders. Bottomley suggested that a "present user" test on the
first boot of the bootloader, which could be detected because the UEFI key
database and the "machine owner key" database do not contain the proper
keys, would alleviate the problem. Garrett pointed out that the shim
bootloader does not do this because it needs to be able to boot unattended,
even on first boot. But, Bottomley saw
that as unfortunate:
[...] what I'm telling you is that by deciding to allow automatic
first boot, you're causing the windows attack vector problem. You could
easily do a present user test only on first boot which would eliminate
it. Instead, we get all of this.
Garrett, though, sees unattended first boot as an absolute requirement,
especially
for those who are trying to do automated installations for Linux systems.
Others disagreed, not surprisingly, and the discussion still continues. It
should be noted that the pre-bootloader that Bottomley released does
do a present user test on first boot (and beyond, depending on whether the
user changes the secure boot configuration).
There does seem to be something of whack-a-mole problem here in terms of
finding all of the ways that this "untrusted root" might be able to impact
secure boot. In addition, new kernel features will have to also be
scrutinized to see whether they need to be disabled depending on
CAP_COMPROMISE_KERNEL.
Not trusting root is a very different
model than kernel developers (and users) are accustomed to. One can
imagine that all of the different problem areas will be tracked down
eventually, but it will be a fair amount of work. Whether that work is
truly justified in support of a feature that is largely (though not
completely) only useful for protecting Windows is a big question. On the
other hand, not being able to [easily] boot Linux on x86 hardware because of key
blacklisting would be problematic too. This one will take some time to
play out.
Comments (44 posted)
November 7, 2012
This article was contributed by Neil Brown
A recurring theme in the
comments
on
various
articles
announcing the new
f2fs "flash-friendly"
filesystem was that surely some other filesystem might already meet
the need, or could be adjusted to meet the need, rather than creating
yet another new filesystem.
This is certainly an interesting question, but not one
that is easy to answer. The cost/benefit calculation for creating a
new filesystem versus enhancing an existing one involves weighing many factors,
including motivational, social, political, and even, to some extent,
technical. It is always more fun building something from scratch;
trying to enter and then influence an existing community is at best
unpredictable.
Of the various factors, the only ones for which there is substantial
visibility to the outside observer are the technical factors, so while
they may not be the major factors, they are the ones that this article
will explore. In particular, we will examine "NILFS2", a filesystem
which has been part of Linux for several years and is — superficially
at least — one the best contenders as a suitable filesystem for modest-sized flash storage.
NILFS2 was not written primarily to be a flash-based filesystem, so
comparing it head-to-head on that basis with f2fs (which was) might
not be entirely fair. Instead we will examine it on its own merits,
comparing it with f2fs occasionally to provide useful context, and ask
"could this have made a suitable base for a flash-focused filesystem?"
NILFS2: what is it?
NILFS2 is the second iteration of the
"New Implementation of a Log-structured File System". It is described as a "Continuous Snapshotting"
filesystem, a feature which will be explored in more detail
shortly.
NILFS2 appears to still be under development, with lots of core
functionality present, but a number of important features still missing, such as
extended attributes, quotas, and an fsck tool.
As such, it is in a similar state to f2fs: well worth experimenting
with, but not really ready for production usage yet.
In contrast with f2fs, NILFS2 uses 64 bits for all block
addressing, 96 bits for time stamps (nanoseconds forever!),
but only 16 bits for link counts (would you ever have more
than 65535 links to a file, or sub-directories in one directory?).
F2fs, in its initial release, uses 32 bits for each of these values.
While f2fs is a hybrid LFS
(Log-structured Filesystem), using
update-in-place in a number of cases, NILFS2 is a pure LFS. With the
exception of the superblock (stored twice, once at either end of the
device), everything is written in one continuous log. Data blocks are
added to the log, then indexing information, then inodes, then indexes
for the inodes, and so on. Occasionally a "super root" inode is written, from
which all other blocks in the filesystem can be found. The address of
the latest "super root" is stored in the superblock, along with
static values for various parameters of the filesystem and a couple of
other volatile values such as the number of free blocks.
Whenever a collection of blocks is written to the log, it is preceded
by a segment summary which identifies all the blocks in the segment
(similar to the Segment Summaries of f2fs which are stored in a separate area).
Consecutive segment summaries are linked together so that, in the event of a
crash, all the segment summaries since the most recent super root can be read
and the state of the filesystem can be reconstructed.
The segment size can be chosen to be any number of blocks, which
themselves must have a size that is a power of two up to the page size of
the host
system. The default block size is 4KB and the default device segment
size is 8MB. Segments can easily be made to line up with erase blocks in
a flash device, providing their size is known. While NILFS2 tries to
write whole segments at a time, it is not always possible, so a number
of consecutive partial segments might be written, each with their own
segment summary block.
Being a pure LFS, NILFS2 will never write into the middle of an active
segment — as f2fs does when space is tight. It insists on
"cleaning" partially used segments (copying live data to a new
segment) to make more space available, and does not even keep track of
which particular blocks in a segment might be free. If there are no
clean segments beyond those reserved for cleaning, the filesystem is
considered to be full.
Everything is a file!
The statement "Everything is a file" is part of the Unix tradition
and, like many such statements, it sounds good without tying you down
to meaning very much. Each of "file", "everything" and even "is a"
is open to some interpretation. If we understand "file" to be a
collection of data and index blocks that provide some linearly
addressed storage, "everything" to mean most data and metadata —
excepting only the superblock and the segment summaries — and "is a"
to mean "is stored inside a", then NILFS2 honors this Unix tradition.
For example, in a more traditional filesystem such as ext3, inodes (of
which there is one per file) are stored at fixed locations in the
device — usually a number of locations distributed across the address
space, but fixed nonetheless. In f2fs, a hybrid approach is used where
the addresses of the inodes are stored in fixed locations (in the Node
Address Table — NAT), while the inodes themselves appear in the log,
wherever is convenient. For NILFS2, the inode table is simply
another file, with its own inode which describes the locations of the
blocks.
This file (referred to as the ifile) also contains a bitmap
allowing unused inodes to be found quickly, and a "block group
descriptor" which allows non-empty bitmaps to be found quickly. With
the default block size, every 225 inodes has 1024 blocks
for bitmaps, and one descriptor block, which lists how many bits are
set in each of those bitmaps. If you want more inodes than that, a second
descriptor block will be automatically allocated.
The inodes themselves are a modest 128 bytes in size, and here I must
confess to an oversimplification in the
article on f2fs.
The statement that "Copy-on-write is rather awkward for objects that are
smaller than the block size" holds a grain of truth, but isn't really
true as it stands. The reality is more subtle.
The advantages of a small inode size are primarily in efficiency.
Less space can be wasted, and fewer I/O requests are needed to load the
same number of inodes. For a traditional filesystem with
pre-allocated inode regions, the space wasted can be a significant
issue. However, that does not really apply to an LFS which allocates the
space on demand. The speed issue is slightly harder to reason about.
Certainly if all the inodes for files in one directory live in one
block, then the common task of running "ls -l" will be
expedited. However if more information, such as extended attributes or
file data for small files, is stored in a big inode, accessing that
will require only one block to be read, not two.
The advantages of a block-sized inode — apart from the extra space,
which is of uncertain value — is that inodes can be updated
independently. OCFS2 (a cluster-based filesystem) uses this
to simplify the locking overhead — a cluster node does not need to
gain exclusive access to inodes in the same block as the one that it
is interested in when it performs an update, because there aren't any.
In an LFS, the main issue is reducing cleaning overhead. As we
noted in the f2fs article, grouping data with similar life expectancy
tends to reduce the expected cost of cleaning, so storing an inode
together with the data that it refers to is a good idea. If there are
several inodes in the one block, then the life expectancy will be the
minimum for all the inodes, and so probably quite different from
nearby data. This could impose some (hard to measure) extra cleaning
cost.
On the whole, it would seem best for an LFS if the one-inode-per-block
model were used, as there is minimal cost of wasted space and real
opportunities for benefits. If ways are found to make maximal use of
that extra space, possibly following some ideas that Arnd Bergmann
recently suggested, then block-sized inodes would be even more
convincing.
Small inodes might be seen as a reason not to choose NILFS2,
though not a very
strong reason. Adjusting NILFS2 to allow full-block inodes would not
be a large technical problem, though it is unknown what sort of social
problem it might be.
As with most filesystems, NILFS2 also stores each directory in a
"file" so there are no surprises there. The surprise is that the
format used is extremely simple. NILFS2 directories are nearly
identical to ext2 directories, the only important difference being
that they store a 64-bit inode number rather than just 32 bits. This
means that any lookup requires a linear search through the directory.
For directories up to about 400 entries (assuming fairly short names
on average), this is no different from f2fs. For very large
directories, the search time increases linearly for NILFS2, while it is
logarithmic for f2fs. While f2fs is not particularly efficient at
this task, NILFS2 clearly hasn't made any effort at effective support
for large directories. There appears to be an intention to implement some sort
of B-tree based directory structure in the future, but this has not
yet happened.
Use the right tree for the job
If everything is a file, then it is clearly important to know what a
file is. It starts at the inode which contains space for seven 64-bit
addresses. When the file is small (seven blocks or less) these
contain pointers to all of the allocated blocks. When the file is
larger, this space changes to become the root of a B-tree, with three
keys (file addresses), three pointers to other B-tree nodes, and a
small header.
The interesting thing about this B-tree is that the leaves do not
contain extents describing contiguous ranges of blocks; instead
they describe each block individually. This is interesting because it
does not fit the typical use-case for a
B-tree.
The particular value of a B-tree is that it remains balanced
independently of the ordering or spacing of keys being inserted or removed.
There is a cost that blocks may occasionally need to be split or
merged, but this is more than compensated for by the ability to add an
unpredictable sequence of keys. When extents are being stored in a
tree, it is not possible to predict how long each extent will be, or
when an extent might get broken, so the sequence of keys added will be
unpredictable and a B-tree is ideal.
When the keys being added to the index are the offsets of
consecutive blocks, then the sequence is entirely predictable and a
different tree is likely to be preferred. A radix tree (where the
path through the tree is a simple function of the key value) is much
more compact than a B-tree (as there is no need to store keys) and
much simpler to code. This is the sort of tree chosen for f2fs, the
tree used for ext3, and generally the preferred choice when block
extents are not being used to condense the index.
The only case where a B-tree of individual blocks might be more
efficient than a radix tree is where the file is very sparse, having
just a few allocated blocks spread throughout a large area that is
unallocated. Sparse files are simply not common enough among regular
files to justify optimizing for them. Nor are many of the special files
that NILFS2 uses likely to be sparse. The one exception is the
checkpoint file (described later), and optimizing the indexing strategy
for that one file is unlikely to have been a motivation.
So we might ask "why?". Why does NILFS2 use a B-tree, or why does it not
use extents in its addressing? An early
design document [PDF]
suggests that B-trees were chosen due to their flexibility, and while it isn't
yet clear that the flexibility is worth the cost, future developments might
show otherwise.
The lack of extent addressing can be explained with a much more concrete
answer once we understand one more detail about file indexing.
Another layer of indirection
The headline feature for NILFS2 is "continuous snapshotting". This
means that it takes a snapshot of the state of the filesystem "every
few seconds". These are initially short-term snapshots (also called
"checkpoints"), and can be converted to long-term snapshots, or
purged, by a user-space process following a locally configurable
policy. This means there are very likely to be lots of active
snapshots at any one time.
As has been mentioned, the primary cost of an LFS is cleaning — the
gathering of live data from nearly-empty segments to other segments
so that more free space can be made available. When there is only one
active filesystem, each block moved only requires one index to be updated.
However, when there are tens or hundreds of snapshots, each block can
be active in a fairly arbitrary sub-sequence of these, so relocating a
block could turn into a lot of work in updating indices.
Following the maxim usually attributed to
David
Wheeler,
this is
solved by adding a level of indirection. One of the special files
that NILFS2 uses is known as the "DAT" or "Disk Address Translation"
file. It is primarily an array of 64-bit disk addresses, though the
file also contains allocation bitmaps like the ifile, and each entry
is actually 256 bits as there is also a record of which checkpoints
the block is active in. The addresses in the leaves of the indexing
trees for almost all files are not device addresses but are, instead, indexes
into this array. The value found contains the actual device address.
This allows a block to be relocated by simply moving the block and
updating this file. All snapshots will immediately know where the new
location is. Doing this with variable length extents would be
impractical, which appears to be one of the main reasons that
NILFS2 doesn't use them.
It should be noted that while this DAT file is similar to the NAT used
by f2fs, it is different in scale. The f2fs NAT is used only to
provide indirection when looking up nodes — inodes, index blocks, etc. —
not when looking up data blocks. The DAT file is used for lookup of
all blocks. This indirection imposes some cost on every access.
Estimating that cost is, again, not easy. Given a 4K block size, each
block in the DAT file provides indexing for 128 other blocks. This
imposes approximately a 1% overhead in storage space and at least a 1%
overhead in throughput. If all the DAT entries for a
given file are adjacent, the overhead will be just that 1%. If they
are very widely spread out, it could be as much as 100% (if each DAT
entry is in a different block of the DAT file). Files that are
created quickly on a fresh filesystem will tend to the smaller number,
files created slowly (like log files) on a well-aged filesystem will
likely tend toward a larger number. An average of 3% or 4% probably
wouldn't be very surprising, but that is little more than a wild
guess.
Against this cost we must weigh the benefit, which is high frequency
snapshots. While I have no experience with this feature, I do have
experience with the
"undo" feature in text editors. In my younger days I used "ed" and
don't recall being upset by the lack of an undo feature. Today I use
emacs and use undo all the time — I don't know that I could go back to
using an editor without this simple feature. I suspect continual
snapshots are similar. I don't miss what I don't have, but I could
quickly get used to them.
So: is the availability of filesystem-undo worth a few percent
in performance? This is a question I'll have to leave to my readers
to sort out. To make it easier to ponder, I'll relieve your curiosity and
clarify why not all files use the DAT layer of indirection. The
answer is of course that the DAT file itself cannot use indirection as
it would then be impossible to find anything. Every other file does use
the DAT and every lookup in those files will involve indirection.
Other miscellaneous metadata
NILFS2 has two other metadata files. Both of these are simple tables
of data without the allocation bitmaps of the DAT file and the ifile.
The "sufile" records the usage of each segment of storage, counting the
number of active blocks in the segment, and remembering when the
segment was written. The former is used to allow the segment to be
reused when it reaches zero. The latter is used to guide the choice of
which segment to clean next. If a segment is very old, there is not
much point waiting for more blocks to die of natural attrition. If it
is young, it probably is worthwhile to wait a bit longer.
The "cpfile" records checkpoints and every time a checkpoint is
created a new record is added to the end of this file. This record
stores enough information to reconstruct the state of all files at
the time of the checkpoint. In particular, this includes the inode
for the ifile. Left to itself, this file would grow without bound.
However, in normal operation a user-space program
(nilfs_cleanerd)
will monitor usage and delete old checkpoints as necessary. This
results in the cpfile becoming a sparse file with lots of empty space
for most of the length of the file, a dense collection of records at
the end, and an arbitrary number of individual blocks sprinkled
throughout (for the long-term snapshots). This is the file for which
a radix-tree index may not be the optimal indexing scheme. It isn't
clear that would matter though.
Pros and cons
So now we must return to the question: from a technical perspective,
would NILFS2 make a suitable base of a flash-optimized filesystem?
The principal property of flash, that the best writes are sequential
writes aligned to the underlying erase block size, is easily met by
NILFS2, making it a better contender than filesystems with lots of
fixed locations, but can we gain any insight by looking at the
details?
One of the several flash-focused features of f2fs is that it has
several segments open for writing at a time. This allows data with
different life expectancies to be kept separate, and also improves the
utilization of those flash devices that allow a degree of parallelism
in access. NILFS2 only has a single segment open at a time, as is
probably appropriate for rotating media with a high seek cost, and
makes no effort to sort blocks based on their life expectancy. Adding
these to NILFS2 would be possible, but it is unlikely that it would be
straightforward.
Looking at the more generally applicable features of NILFS2, the
directory structure doesn't scale, the file indexing is less than
optimal, and the addressing indirection imposes a cost of uncertain
size. On the whole, there seems to be little to recommend it and a
substantial amount of work that would be required to tune it to flash
in the way that f2fs has been tuned. It gives the impression of being
a one-big-idea filesystem. If you want continual snapshots, this is
the filesystem for you. If not, it holds nothing of real interest.
On the other side, f2fs comes across as a one-big-idea filesystem too.
It is designed to interface with the FTL (Flash translation layer)
found in today's flash devices, and provides little else of real
interest, providing no snapshots and not working at all well with any
storage device other than flash. Could this be a sign of a trend
towards single-focus filesystems? And if so, is that a good thing, or
a bad thing?
So, while NILFS2 could have been used as a starting point for a
flash-focused filesystem, it is not at all clear that it would have
been a good starting point, and it is hard to challenge the decision to
create a new filesystem from scratch. Whether some other filesystem
might have made a better start will have to be a question for another
day.
Comments (18 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>