Brief items
The 3.7 kernel is out,
released by Linus
on December 10.
"
Anyway, it's been a somewhat drawn out release despite the 3.7 merge
window having otherwise appeared pretty straightforward, and none of
the rc's were all that big either. But we're done, and this means that
the merge window will close on Christmas eve." Of course, "drawn
out" is a relative term; at 72 days, this cycle is only a few days
above average in length.
Headline features in this kernel include
64-bit
ARM support, improved security with
supervisor-mode access prevention,
SMB 2.1 support, server-side
TCP fast
open support,
signed kernel modules,
and more. See
the
KernelNewbies 3.7 page for details.
Stable updates:
3.0.55
and 3.4.22, containing a build error fix,
were released on December 5. The rather larger 3.2.35 update was released on
December 7.
3.0.56,
3.4.23 and
3.6.10 were released on December 10.
No stable updates are in the review process as of this writing.
Comments (none posted)
Modern network hardware has often sprouted various "offload"
engines, unfortunately now often enabled by default, which tend to
do more damage than good except for extreme benchmarking fanatics,
often primarily on big server machines in data centers. Start by
turning them off. We'll write more on this topic soon. The
implementors of this "smart" hardware are less "smart" than they
think they are.
—
The
Bufferbloat project on CoDel benchmarking best practices
When we enter the kernel mode, we start with saving CPU state.
Usually (and you are going to hate that word well before you read
to the end) it's stored in struct pt_regs, but it might be more
complicated. For our purposes it's better to think of it as
abstract saved state, leaving aside the question of how it's
represented.
—
Al Viro teaches a class on signal
handling
Comments (3 posted)
If (maintainer thinks their patch is right) {
patch doesn't need review
} else {
/* maintainer thinks the patch is wrong. */
/* XXX: why would you think your own patch is wrong? */
patch needs review
}
—
Dave Chinner
Review is part of the way we work as a community and we should
figure out how to fix our review process so that we can have
meaningful results from the review or we lose confidence in the
process and it makes it much harder to get reviewers to spend time
reviewing when their reviews are ultimately ignored.
—
Ric Wheeler
Anybody who claims that our "process" requires that things like
that go on the mailing list and pass long reviews and discussions
IS JUST LYING.
Because it's not true. We discuss big features, and we want code
review, yes, but the fact is, most small obvious patches do *not*
get reviewed, they just get fixed. You all know that, why the hell
are you suddenly claiming that this is so magically different?
—
Linus Torvalds
This is why this discussion reminds me so much of the wakelocks
discussion, and why I've made the same decision the Android folks
made, except they wasted far more time and got far more frustrated
--- I'll just keep the damned thing as a out-of-tree patch, until
there are enough other people willing to say that they need and are
using this patch because their workloads and use cases need it. It
will save me a whole lot of time.
—
Ted Ts'o
Comments (none posted)
Kernel development news
By Jonathan Corbet
December 12, 2012
The 3.8 merge window looks to be an interesting time. In theory, it closes
just before the Christmas holiday, though Linus has threatened to start his
celebrations early. Despite the possibly shortened window, there are,
according to linux-next maintainer Stephen
Rothwell, "
more commits in linux-next than ever before."
So expect to see a lot of changes flowing into the mainline in a relatively
short period of time.
As of this writing, some 3800 of those changes have been merged by
Linus. The most significant user-visible changes include:
- The cpuidle subsystem is now able to
associate different drivers with each CPU. This capability is needed
to support asymmetric architectures like big.LITTLE.
- Linux running as a Microsoft Hyper-V guest now supports memory-use
reduction via the Hyper-V balloon driver.
- Applications accessing huge pages via the mmap() or SYSV IPC
interfaces can now specify which page size they want.
- The x86 architecture code can, finally, support hotplugging of the
initial boot CPU ("CPU0").
- On the other hand, as discussed at the
2012 Kernel Summit, support for the venerable 386 architecture has been
removed from the kernel. Peter Anvin informed Linus of an important
loss of functionality from this change: "Unfortunately there's a
nostalgic cost: your old original 386 DX33 system from early 1991
won't be able to boot modern Linux kernels anymore." Linus was
unmoved, though, and merged the change.
- The XFS filesystem has gained a new verification mechanism that can
detect corrupted data read from the storage device.
- New hardware support includes:
- Processors and systems:
Broadcom BCM281XX SoCs,
Allwinner A1X SoCs,
USI Topkick boards,
ZyXEL NSA-310 boards,
MPL CEC4 boards, and
Samsung EXYNOS5440 SoCs.
Support for SH-Mobile SH7367 and SH7377 CPUs has been removed.
Also removed is support for the
Intel
PXA2xx/PXA3xx Marvell PXA95x architecture
on the assumption that nobody will miss it; anybody who disagrees
may want to do so in the near future.
- Memory-technology devices:
Wondermedia SD/MMC host controllers, and
Realtek PCI-E SD/MMC and Memstick card interfaces.
- Miscellaneous:
Texas Instruments ADS7830 analog-to-digital converters (ADCs),
Texas Instruments ADC081C021/027 ADCs,
Dialog Semiconductor DA9055 ADCs,
Analog Device AD54xx digital-to-analog converters,
ST Microelectronics SPEAr PLGPIO controllers,
Dialog Semiconductor DA9055 GPIO controllers,
Cirrus Logic CLPS711x/EP721x/EP731x-based GPIO controllers,
Technologic Systems TS-5500 digital I/O controllers,
Exar XR17V35x multi-port PCIe UARTs,
ARC (Synopsys) UARTs,
SystemBase PCI Multiport UARTs,
Commtech Fastcom Async-335 and Fastcom Async-PCIe cards,
ACPI enumerated SDHCI controllers,
Firewire-attached TTY devices,
Analog devices ADIS16136 gyroscopes, and
Analog Devices ADIS16480 inertial measurement units.
- Thermal: The kernel has a new thermal governor subsystem
capable of responding when the system gets too hot. A driver has
been added for ST-Ericsson DB8500 thermal regulators.
- USB:
Renesas R-Car USB phys.
- Staging graduations: the IndustryPack bus driver,
Maxim max1363 ADC driver,
Analog Devices AD7793 ADC driver, and
Analog Devices AD7298 ADC driver
have moved out of the staging tree. The RealTek PCI-E card reader
driver has been removed from staging since that functionality is
now provided by a separate mainline driver.
Changes visible to kernel developers include:
- The _devinit, __devinitdata,
__devinitconst,
__devexit, and __devexit_p() macros are on their way
out; many drivers have been fixed to stop using them. In the future,
the CONFIG_HOTPLUG option will no longer exist, so
initialization and finalization code needs to be kept around forever.
- The power management quality-of-service subsystem can now support
device-specific QOS flags. Two flags have been defined in 3.8:
PM_QOS_FLAG_NO_POWER_OFF and
PM_QOS_FLAG_REMOTE_WAKEUP.
- The devfreq subsystem now supports devices that can be suspended (or
placed into an idle state) independently of the rest of the system.
- The UIO driver subsystem has a new generic platform driver allowing
UIO devices to access memory allocated by CMA or the IOMMU subsystem.
- The per-entity load-tracking patch set has been merged; this code
allows the scheduler to better understand which processes (and control
groups) are putting load on the system, thus improving load balancing
and related decisions.
- The callback-free RCU implementation
has been merged, allowing the offloading of some read-copy-update
overhead from a subset of CPUs in the system.
The 3.8 merge window has just begun; there are a lot of subsystem trees yet
to be pulled into the mainline. LWN will continue to follow the kernel
repository as Linus pulls in more patches and establishes the feature set
for the 3.8 release; stay tuned.
Comments (9 posted)
By Michael Kerrisk
December 13, 2012
The results of the user namespaces work on Linux have been a long time in
coming, probably because they are the most complex of the various namespaces that have been
added to the kernel so far. The first pieces of the implementation started
appearing when Linux 2.6.23 (released in late 2007) added the
CLONE_NEWUSER flag for the clone() and unshare() system calls. By
Linux 2.6.29, that flag also became meaningful in the clone()
system call. However, until now many of the pieces necessary for a complete
implementation have remained absent.
We last looked at user
namespaces back in April, when Eric Biederman was working to push a raft of
patches into the kernel with the goal of bringing the implementation closer
to completion. Eric is now engaged in pushing further patches into the
kernel with the goal of having a more or less complete implementation of
user namespaces in Linux 3.8. Thus, it seems to be time to have another look
at this work. First, however, a brief recap of user namespaces is probably
in order.
User namespaces allow per-namespace mappings of user and group IDs. In
the context of containers, this means that
users and groups may have privileges for certain operations inside the
container without having those privileges outside the container. (In other
words, a process's set of capabilities for operations inside a user
namespace can be quite different from its set of capabilities in the
host system.) One of the specific goals of user namespaces is to allow a
process to have root privileges for operations inside the container, while
at the same time being a normal unprivileged process on the wider system
hosting the container.
To support this behavior, each of a process's user IDs has, in effect,
two values: one inside the container and another outside the
container. Similar remarks hold true for group IDs. This duality is
accomplished by maintaining a per-user-namespace mapping of user IDs: each
user namespace has a table that maps user IDs on the host system to
corresponding user IDs in the namespace. This mapping is set and viewed by
writing and reading the /proc/PID/uid_map
pseudo-file, where PID is the process ID of one of the processes in
the user namespace. Thus, for example, user ID 1000 on the host system
might be mapped to user ID 0 inside a namespace; a process with a user ID
of 1000 would thus be a normal user on the host system, but would have root
privileges inside the namespace. If no mapping is provided for a particular
user ID on the host system, then, within the namespace, the user ID is
mapped to the value provided in the file
/proc/sys/kernel/overflowuid (the default value in this file is
65534). Our earlier article went into more details of the
implementation.
One further point worth noting is that the description given in the
previous paragraph looks at things from the perspective of a single user
namespace. However, user namespaces can be nested, with user and group ID
mappings applied at each nesting level. This means that a process might
have distinct user and group IDs in each of the nested user namespaces in
which it is a member.
Eric has assembled a number of namespace-related patch sets for
submission in the upcoming 3.8 merge window. Chief among these is the set that completes the main pieces of the
user namespace infrastructure. With the changes in this set,
unprivileged processes can now create new user namespaces (using
clone(CLONE_NEWUSER)). This is safe, says Eric, because:
Now that we have been through every permission check in the kernel
having uid == 0 and gid == 0 in your local user namespace no
longer adds any special privileges.
Even having a full set of caps in your local user namespace is safe because
capabilities are relative to your local user namespace, and do not confer
unexpected privileges.
The point that Eric is making here is that following the work
(described in our earlier article) to implement the kuid_t and
kgid_t types within the kernel, and the conversion of various
calls to capable() to its namespace analog, ns_capable(),
having a user ID of zero inside a user namespace no longer grants special
privileges outside the namespace. (capable() is the kernel
function that checks whether a process has a capability;
ns_capable() checks whether a process has a capability
inside a namespace.)
The creator of a new user namespace starts off with a full set of
permitted and effective capabilities within the namespace, regardless of
its user ID or capabilities on the host system. The creating process thus
has root privileges, for the purpose of setting up the environment inside
the namespace in preparation for the creation or the addition of other
processes inside the namespace. Among other things, this means that the
(unprivileged) creator of the user namespace (or indeed any process with
suitable capabilities inside the namespace) can in turn create all other
types of namespaces, such as network, mount, and PID namespaces (those
operations require the CAP_SYS_ADMIN capability). Because the
effect of creating those namespaces is limited to the members of the user
namespace, no damage can be done in the host system.
Other notable user-space changes in Eric's patches include extending the unshare()
system call so that it can be employed to create user namespaces, and extensions that allow a process to use the setns()
system call to enter an existing user namespace.
Looking at some of the other patches in the series gives an idea of
just how subtle some of the details are that must be dealt with in order to
create a workable implementation of user namespaces. For example, one of the patches deals with the behavior of
set-user-ID (and set-group-ID) programs. When a set-user-ID program is
executed (via the execve() system call), the effective user ID of
the process is changed to match the user ID of the executable file. When a
process inside a user namespace executes a set-user-ID program, the effect
is to change the process's effective user ID inside the namespace to
whatever value was mapped for the file user ID. Returning to the example
used above, where user ID 1000 on the host system is mapped to user ID 0
inside the namespace, if a process inside the user namespace executes a
set-user-ID program owned by user ID 1000, then the process will assume an
effective user ID of 0 (inside the namespace).
However, what should be done if the file user ID has no mapping inside
the namespace? One possibility would be for the execve() call to
fail. However, Eric's patch implements another approach: the set-user-ID
bit is ignored in this case, so that the new program is executed, but the
process's effective user ID is left unchanged. Eric's reasoning is that
this mirrors the semantics of executing a set-user-ID program that resides
on a filesystem that was mounted with the MS_NOSUID flag. Those
semantics have been in place since Linux 2.4, so the kernel code paths
should for this behavior should be well tested.
Another notable piece of work in Eric's patch set concerns the files in
the /proc/PID/ns directory. This directory
contains one file for each type of namespace of which the process is a
member (thus, for each process, there are the files ipc, mnt,
net, pid, user, and uts). These files
already serve a couple of purposes. Passing an open file descriptor for one
of these files to setns() allows a process to join an existing
namespace. Holding an open file descriptor for one of these files, or bind
mounting one of the files to some other location in the filesystem, will
keep a namespace alive even if all current members of the namespace
terminate. Among other things, the latter feature allows the piecemeal
construction of the contents of a container. With this patch in Eric's recent series, a single
/proc inode is now created per namespace, and the
/proc/PID/ns files are instead implemented as
special symbolic links that refer to that inode. The practical upshot is
that if two processes are in, say, the same user namespace, then calling
stat() on the respective
/proc/PID/ns/user files will return the same inode
numbers (in the st_ino field of the returned stat
structure). This provides a mechanism for discovering if two processes are
in the same namespace, a long-requested feature.
This article has covered just the patch set to complete the user
namespace implementation. However, at the same time, Eric is pushing a
number of related patch sets towards the mainline, including: changes to the networking stack so that user
namespace root users can create network namespaces: enhancements and clean-ups of the PID namespace
code that, among other things, add unshare() and
setns() support for PID namespaces; enhancements to the mount namespace code that
allow user namespace root users to call chroot() and to create and
manipulate mount namespaces; and a series of
patches that add support for user namespaces to a number of file
systems that do not yet provide that support.
It's worth emphasizing one of the points that Eric noted
in a documentation patch for the user
namespace work, and elaborated on in a private mail. Beyond the
practicalities of supporting containers, there is another significant
driving force behind the user namespaces work: to free the UNIX/Linux API
of the "handcuffs" imposed by set-user-ID and set-group-ID programs. Many
of the user-space APIs provided by the kernel are root-only simply to
prevent the possibility of accidentally or maliciously distorting the
run-time environment of privileged programs, with the effect that those
programs are confused into doing something that they were not designed to
do. By limiting the effect of root privileges to a user namespace, and
allowing unprivileged users to create user namespaces, it now becomes
possible to give non-root programs access to interesting functionality that
was formerly limited to the root user.
There have been a few Acked-by: mails sent in response to
Eric's patches, and a few small questions, but the patches have otherwise
passed largely without comment, and no one has raised objections. It seems
likely that this is because the patches have been around in one form or
another for a considerable period, and Eric has gone to considerable effort
to address objections that were raised earlier during the user namespaces
work. Thus, it seems that there's a good chance that Eric's pull request to have the patches merged in the
currently open 3.8 merge window will be successful, and that a complete
implementation of user namespaces is now very close to reality.
Comments (19 posted)
December 11, 2012
This article was contributed by Neil Brown
When thinking about filesystems for modern flash storage devices, as
we have recently done with
f2fs and
NILFS2, two other filesystems that are likely
to quickly spring
to mind, and be almost as quickly discarded, are JFFS2 and UBIFS.
They spring to mind because they were designed specifically to work
with flash, and are discarded because they require access to "raw
flash" whereas the flash devices we have been considering have a
"flash translation layer" (FTL) which hides some of the details of the
flash device and which needs to be accessed much like a disk drive.
This quick discarding may well not be appropriate — these are open-source
filesystems after all and are thus free to be tinkered with. If the
Apollo 13 technicians were
able
to
link the Lithium hydroxide
canisters from the command module to the CO₂ scrubber in the Lunar
module, it shouldn't be too hard for us to link a raw-flash filesystem
to a FTL based storage chip — if it seemed like a useful thing to do.
Raw access to a flash device goes through the
"mtd" (Memory Technology
Devices) interface in Linux and, while this is a rich interface,
the vast majority of accesses from a filesystem are via three functions:
mtd_read(),
mtd_write() and mtd_erase(). The first two are easily
implemented by a block device — though you need to allow for the fact
that the mtd interface is synchronous while the block layer interface
is asynchronous — and the last can be largely ignored as an FTL
handles erasure internally. In fact, Linux provides a "block2mtd" device
which will present an arbitrary block device as an mtd device. Using this
might not be the most efficient way to run a filesystem on new
hardware, but it would at least work as a proof-of-concept.
So it seems that there could be some possibility of using one of these
filesystems, possibly with a little modification, on an FTL-based flash
device, and there could certainly be value in understanding them a
little better as, at the very least, they could have lessons to teach
us.
A common baseline
Despite their separate code bases, there is a lot of similarity
between JFFS2 and UBIFS — enough that it seems likely that the
latter was developed in part to overcome the shortcomings of the
former. One similarity is that, unlike the other filesystems we have
looked at, neither of these filesystems has a strong concept of a "basic block
size". The concept is there if you look for it, but it isn't
prominent.
One of the main uses of a block size in a filesystem is to manage free
space. Some blocks are in use, others are free. If a block is only
partially used — for example if it contains the last little bit of a
file — then the whole block is considered to be in use. For flash
filesystems, blocks are not as useful for free-space management as
this space is managed in
terms of "erase blocks," which are much larger than the basic blocks of
other filesystems, possibly as large as a few megabytes.
Another use of blocks in a filesystem is as a unit of metadata management.
For example NILFS2 manages
the ifile (inode file) as a sequence of blocks (rather than
just a sequence of inodes), while F2FS manages each directory as a set
of hash tables, each of which contains a fixed number of blocks.
JFFS2 and UBIFS don't take this approach at all. All data
is written consecutively to one or more erase blocks
with some padding to align things to four-byte boundaries, but with no
alignment so large that it could be called a block. When indexing of
data is needed, an erase-block number combined with a byte offset meets
the need, so the lack of alignment does not cause an issue there.
Both filesystems further make use of this freedom in space allocation
by compressing the data before it is written. Various compression
schemes are available including LZO and ZLIB together with some
simpler schemes like run-length encoding. Which scheme is chosen
depends on the desired trade off between space saving and execution
time. This compression can make a small flash device hold nearly
twice as much as you might expect, depending on the compressibility of
the files of course. Your author still recalls the pleasant surprise
he got when he found out how much data would fit on the JFFS2
formatted 256MB flash in the original
Openmoko Freerunner: a
reasonably complete Debian root filesystem with assorted development
tools and basic applications still left room for a modest amount of
music and some OSM map tiles.
In each case, the data and metadata of the filesystem are collected
into "nodes" which are concatenated and written out to a fresh erase
block. Each node records the type of data (inode, file, directory
name, etc), the address of the data (such as inode number), the type of
compression and a few other details.
This makes it possible to identify the contents of the
flash when mounting and when cleaning, and effectively replaces the
"segment summary" that is found in f2fs and NILFS2.
Special note should be made of the directory name nodes. While the
other filesystems we have studied store a directory much like a file,
with filenames stored at various locations in that file, these two
filesystems do not. Each entry in the directory is stored in its own
node, and these nodes do not correspond to any particular location in
a "file" — they are simply unique entries. JFFS2 and UBIFS each have
their own particular way of finding these names as we shall see, but
in neither case is the concept of a file offset part of that.
The one place where a block size is still visible in these filesystems
is in the way they chop a file up into nodes for storage. In JFFS2, a
node can be of any size up to 4KB so a log file could, for example, be
split up as one node per line. However the current implementation
always writes whole pages — to quote the in-line commentary,
"It sucks, but it's simple".
For UBIFS, data nodes must start at a 4KB-aligned
offset in the file so they are typically 4KB in size (before
compression) except when at the end of the file.
JFFS2 — the journaling flash filesystem
A traditional journaling filesystem, such as ext3 or xfs, adds a
journal to a regular filesystem. Updates are written first to the
journal and then to the main filesystem. When mounting the filesystem
after a shutdown, the journal is scanned and anything that is found is
merged into the main filesystem, thus providing crash tolerance.
JFFS2 takes a similar approach with one important difference — there
is no "regular filesystem". With JFFS2 there is only a journal, a
journal that potentially covers the entire device.
It is probably a little misleading to describe JFFS2 as "just one
journal". This is because it might lead you to think that when it
gets to the end of the journal it just starts again at the beginning.
While this was true of JFFS1, it is not for JFFS2.
Rather it might be clearer to think of each erase block as a little
journal. When one erase block is full, JFFS2 looks around for another
one to use. Meanwhile if it notices that some erase blocks are nearly
empty it will move all the active nodes out of them into a clean erase
block, and then erase and re-use those newly-cleaned erase blocks.
When a JFFS2 filesystem is mounted, all of these journals, and thus
the entire device, are scanned and every node found is incorporated
into an in-memory data structure describing the filesystem. Some
nodes might invalidate other nodes; this may happen when a file is
created and then removed: there will be a node recording the new
filename as belonging to some directory, and then another node
recording that the filename has been deleted. JFFS2 resolves all
these modifications and ends up with a data structure that describes
the filesystem as it was that last time something was written to it,
and also describes where the free space is. The structure is kept as
compact as possible and naturally does not contain any file data; instead,
it holds
only the addresses where the data should be found and so, while it
will be much smaller than the whole filesystem, it will still grow
linearly as the filesystems grows.
This need to scan the entire device at mount time and
store the skeleton of the filesystem in memory puts a limit on the
size of filesystem that JFFS2 is usable for. Some tens of megabytes,
or even a few hundred megabytes, is quite practical. Once the device
gets close to, or exceeds, a gigabyte, JFFS2 become quite impractical.
Even if memory for storing the tree were cheap, time to mount the
filesystem is not.
This is where UBIFS comes in. While the details are quite different,
UBIFS is a lot like JFFS2 with two additions: a tree to index all the
nodes in the filesystem, and another tree to keep track of free
space. With these two trees, UBIFS avoids both the need to scan the entire
device at mount time and the need to keep a skeleton of the
filesystem in memory at all times. This allows UBIFS to scale to much
larger filesystems — certainly many tens of gigabytes and probably more.
But before we look too closely at these trees it will serve us well to
look at some of the other details and in particular at "UBI", a layer
between the MTD flash interface layer and UBIFS. UBI uses an unsorted
collection of flash erase blocks to present a number of file system
images; UBI stands for Unsorted Block Images.
UBI — almost a Flash Translation Layer
The
documentation
for UBI explicit states that it is not a flash translation
layer. Nonetheless it shares a lot of functionality with an FTL,
particularly wear leveling and error management. If you imagined UBI
as an FTL where the block size was the same as the size of an erase
block, you wouldn't go far wrong.
UBI uses a flash device which contains a large number of Physical
Erase Blocks (PEBs) to provide one or more virtual devices (or "volumes")
which each
consist of a smaller number of Logical Erase Blocks (LEBs),
each slightly smaller than a PEB. It maintains a mapping from LEB to
PEB and this mapping may change from time to time due to various
causes including:
-
Writing to an LEB. When an LEB is written,
the data will be written to a new, empty, PEB and the mapping
from LEB to PEB will be updated. UBI is then free to erase the old PEB at its
leisure. Normally, the first new write to an LEB will make all the data
previously there inaccessible. However, a feature is available where
the new PEB isn't committed until the write request completes. This
ensures that after a sudden power outage, the LEB will either have the
old data or the complete new data, never anything else.
-
Wear leveling. UBI keeps a header at the start of each PEB which is
rewritten immediately after the block is erased. One detail in the
header is how many times the PEB has been written and erased. When UBI notices
that the difference between the highest write count and the lowest
write count in all the PEBs gets too high (based on a compile-time
configuration parameter: MTD_UBI_WL_THRESHOLD), it will move
an LEB stored in a PEB with a low write count (which is assumed to be
stable since the PEB containing it has not been rewritten often) to one
with a high write
count. If this data continues to be as stable as it has been, this will
tend to reduce the variation among write counts and achieve wear
leveling.
-
Scrubbing. NAND flash includes an error-correcting code (ECC) for
each page (or sub-page) which can detect multiple-bit errors and correct single-bit
errors. When an error is reported while reading from a PEB, UBI will
relocate the LEB in that PEB to another PEB so as to guard against a
second bit error, which would be uncorrectable. This process happens
transparently and is referred to as "scrubbing".
The functionality described above is already an advance on the flash
support that
JFFS2 provides. JFFS2 does some wear leveling but it is not precise.
It keeps no record of write counts but, instead, decides to relocate an
erase-block based on the roll of a dice (or actually the sampling of a random
number) instead. This probably provides some leveling of wear, but
there are no guarantees. JFFS2 also has no provision for scrubbing.
The mapping from PEB to LEB is stored spread out
over all active erase blocks in the flash device. After the PEB header
that records the write
count there is a second header which records the volume identifier and
LEB number of the data stored here. To recover this mapping at mount
time, UBI needs to read the first page or two from every PEB. While
this isn't as slow as reading every byte like JFFS2 has to, it would still
cause mount time to scale linearly with device size — or nearly
linearly as larger devices are likely to have larger erase block
sizes.
Recently this situation has improved. A new
feature known has "fastmap" made its
way into the UBI driver for Linux 3.7. Fastmap stores a recent copy
of the mapping in some erase block together with a list of the several
(up to 256) erase blocks which will be written next, known as the
pool.
The mount process then needs to examine the first 64 PEBs to find a
"super block" which points to the mapping, read the mapping, and then
read the first page of each PEB in the pool to find changes to the
mapping. When the pool is close to exhaustion, a new copy of the
mapping with a new list of pool PEBs is written out.
This is clearly a little more complex, but puts a firm cap
on the mount time and so ensures scalability to much larger devices.
UBIFS — the trees
With UBIFS, all the filesystem content — inodes, data, and directory
entries — is stored in nodes in various arbitrary Logical Erase Blocks,
and the addresses of these blocks are stored in a single B-tree. This is
similar in some ways to reiserfs (originally known as "treefs") and
Btrfs, and contrasts with filesystems like f2fs, NILFS2 and ext3
where inodes, file data, and directory entries are all stored with
quite different indexing structures.
The key for lookup in this B-tree is 64 bits wide, formed from a 32-bit inode
number, a three-bit node type, and a 29-bit offset (for file data)
or hash value (for directory entries). This last field, combined with a 4KB
block size used for indexing, limits the size of the largest file to two
terabytes, probably the smallest limit in the filesystem.
Nodes in this B-tree are, like other nodes, stored in whichever erase
block happens to be convenient. They are also like other nodes in that they are not
sized to align with any "basic block" size. Rather the size is chosen
based on the fan-out ratio configured for the filesystem. The default
fan-out is eight, meaning that each B-tree node contains eight keys
and eight pointers to other nodes, resulting in a little under 200
bytes per node.
Using small nodes means that fewer bytes need to be written when
updating indexes. On the other hand, there are more levels in the tree so more
reading is likely to be required to find a node. The ideal trade off
will depend on the relative speeds of reads and writes. For flash
storage that serves reads a lot faster than writes — which is not
uncommon, but seemingly not universal — it is likely that this fan-out
provides a good balance. If not, it is easy to choose a different
fan-out when creating a filesystem.
New nodes in the filesystem do not get included in the indexing B-tree
immediately. Rather, their addresses are written to a journal, to
which a few LEBs are dedicated. When the filesystem is mounted, this
journal is scanned, the nodes are found, and based on the type and
other information in the node header, they are merged into the indexing
tree. This merging also happens periodically while the filesystem is
active, so that the journal can be truncated.
Those nodes that are not yet indexed are sometimes referred to as
"buds" — a term which at first can be somewhat confusing. Fortunately
the UBIFS code is sprinkled with some very good documentation so it
wasn't too hard to discover that "buds" were nodes that would soon be
"leaves" of the B-tree, but weren't yet —
quite an apt botanical joke.
Much like f2fs, UBIFS keeps several erase blocks open for writes at
the same time so that different sorts of data can be kept separate
from each other, which, among other things, can improve cleaning
performance. These open blocks are referred to as different "journal heads".
UBIFS has one "garbage collection" head where the cleaner writes nodes
that it moves — somewhat like the "COLD" sections in f2fs. There is
also a "base" head where inodes, directory entries, and other non-data
nodes are written — a bit like the "NODE" sections in f2fs.
Finally, there are one or more "data" heads
where file data is written, though the current code doesn't appear to
actually allow the "or more" aspect of the design.
The other tree that UBIFS maintains is used for keeping track of free
space or, more precisely, how many active nodes there are in each
erase block. This tree is a radix tree with a fan-out of four. So if
you write the address of a particular LEB in base four (also known as
radix-four), then each digit would correspond to one level in the tree,
and its value indicates which child to follow to get down to the next
level.
This tree is stored in a completely separate part of the device with
its own set of logical erase blocks, its own garbage collection, and
consequently its own table of LEB usage counters. This last table
must be small enough to fit in a single erase block and so imposes a
(comfortably large) limit on the filesystem size. Keeping this tree
separate seems like an odd decision, but doubtlessly simplifies the
task of keeping track of device usage. If the node that records the
usage of an LEB were to be stored in that LEB, there would be
additional complexity which this approach avoids.
A transition to FTL?
While JFFS2 clearly has limits, UBIFS seem to be much less limited.
With 32 bits to address erase blocks which, themselves, could
comfortably cover several megabytes, the addressing can scale to petabyte
devices. The B-tree indexing scheme should allow large directories
and large files to work just as well as small ones. The two terabyte
limit on individual files might one day be a limit but that still
seems a long way off. With the recent addition of fastmap for UBI,
UBIFS would seem ready to scale to the biggest flash storage we have
available. But it still requires raw flash access while a lot of flash
devices force all access to pass through a flash translation layer.
Could UBIFS still be useful on those devices?
Given that the UBI layer looks a lot like an FTL it seems reasonable
to wonder if UBI could be modified slightly to talk to a regular block
device instead, and allow it to talk to an SD card or similar. Could
this provide useful performance?
Unfortunately such a conversion would be a little bit more than an
afternoon's project. It would require:
- Changing the expectation that all I/O is synchronous. This might
be as simple as waiting immediately after submitting each request,
but it would be better if true multi-threading could be achieved.
Currently, UBIFS disables readahead because it is incompatible with
a synchronous I/O interface.
- Changing the expectation that byte-aligned reads are possible.
UBIFS currently reads from a byte-aligned offset into a buffer,
then decompresses from there. To work with the block layer it
would be better to use a larger buffer that was sector-aligned, and
then understand that the node read in would be found at an offset into that
buffer, not at the beginning.
- Changing the expectation that erased blocks read as all ones.
When mounting a filesystem, UBIFS scans various erase blocks and
assumes anything that isn't 0xFF is valid data. An
FTL-based flash store will not provide that guarantee, so UBIFS would need to
use a different mechanism to reliably detect dead data. This is
not conceptually difficult but could be quite intrusive to the
code.
- Finding some way to achieve the same effect as the atomic LEB
updates that UBI can provide. Again, a well understood problem,
but possibly intrusive to fix.
So without a weekend to spare, that approach cannot be experimented
with. Fortunately there is an alternative.
As mentioned, there already exists a "block2mtd" driver which can be
used to connect UBIFS, via UBI and mtd, to a block device. This driver
in deliberately very simple and consequently quite inefficient. For
example, it handles the mtd_erase() function by writing blocks
full of 0xFF to the device. However, it turns out that it is
only an afternoons project to modify it to allow for credible testing.
This patch modifies the block2mtd driver to
handle mtd_erase() by recording the location of erased
blocks in memory,
return 0xFF for any read of an erased block, and
not write out the PEB headers until real data is to be written to
the PEB.
The result of these changes is that the pattern of reads and, more importantly,
writes to the block device will be much the same as the pattern of
reads and writes expected from a more properly modified UBIFS. It is
clearly not useful for real usage as important information is kept in
memory, but it can provide a credible base for performance testing.
The obvious choice of what to test it against is f2fs. Having
examined the internals of both f2fs and UBIFS, we have found substantial
similarity which is hardly surprising as they have both been designed
to work with flash storage. Both write whole erase blocks at a time
where possible, both have several erase blocks "open" at once, and
both make some efforts to collect similar data into the same erase
blocks. There are of course differences though:
UBIFS probably scales better to large directories,
it can compress data being written, and
it does not currently support exporting via NFS, partly because of the
difficulty of providing a stable index for directory entries.
The compression support is probably most interesting. If the CPU is
fast enough, compression might be faster than writing to flash and
this could give UBIFS an edge in speed.
I performed some testing with f2fs and UBIFS; the latter was tested twice,
with and without the use of compression (the non-compression case is marked
below as "NC").
Just for interest's sake I've added NILFS2, ext4 and
Btrfs. None of these are particularly designed for FTL based flash, though
NILFS2 can align writes with the erase blocks and so might perform well.
The results of the last two should be treated very
cautiously. No effort was made to tune them to the device used, and
all the results are based on writing to an empty device. For f2fs,
UBIFS, and NILFS2 we know that they can "clean" the device so they always write to
unused erase blocks. ext4 and Btrfs do not do the same cleaning so it
is quite possible that the performance will degrade on a more "aged"
filesystem. So the real long term values for these filesystems
might be better, and might be worse, than what we see here.
For testing I used a new class 10 16GB microSD card, which claims 10MB/s
throughput and seems to provide close to that for sequential IO. According
to the flashbench
tool, the card appears to have an 8MB erase block size; five erase blocks
can be open at a time, and only the first
erase block optimized for a PC-style file attribute table. The kernel used
was 3.6.6 for openSUSE with the above mentioned patch and the
v3 release of f2fs.
The tests performed were very simple. To measure small file performance,
a tar archive of the Linux kernel (v3.7-rc6) was unpacked ten times and then —
after unmounting and remounting — the files were read back in again
and "du" and "rm -r" were timed to check metadata performance. The
"rm -r" test was performed with a warm cache, immediately after the "du -a", which was performed on a cold cache.
The average times in seconds for these operations were:
| | ubifs | ubifs — NC | f2fs | NILFS2 | ext4 | Btrfs |
| Write kernel | 72.4 | 139.9 | 118.4 | 140.0 | 135.5 | 93.6 |
| Read
kernel | 72.5 | 129.6 | 175.7 | 95.6 | 108.8 | 121.0 |
| du -s | 9.9 | 8.7 | 48.6 | 4.4 | 4.4 | 13.8 |
| rm -r | 0.48 | 0.45 | 0.36 | 11.0 | 4.9 | 33.6 |
Some observations:
- UBIFS, with compression, is clearly the winner at reading and writing
small files. This test was run on an Intel Core i7 processor running at
1GHz; on
a slower processor, the effect might not be as big. Without
compression, UBIFS is nearly the slowest, which is a little surprising,
but that could be due to the multiple levels that data passes though
(UBI, MTD, block2mtd).
- f2fs is surprisingly poor at simple metadata access (
du -s). It is
unlikely that this is due to the format chosen for the filesystem — the
indirection of the Node Address Table is the only aspect of the design that
could possibly cause this slowdown and it could explain at most a factor of two.
This poor performance is probably some simple implementation issue. The number is
stable across the ten runs, so it isn't just a fluke.
- Btrfs is surprisingly fast at writing. The kernel source tree is
about 500MB in size, so this is around 5.5MB/sec, which is well
below what the device can handle but is still faster than anything
else. This presumably reflects the performance-tuning efforts that
the Btrfs team have made.
- "
rm -r" is surprisingly slow for the non-flash-focused
filesystems, particularly Btrfs. The
variance is high too. For ext4, the slowest "rm -r"
took 32.4 seconds, while, for Btrfs, the slowest was 137.8 seconds —
over 2 minutes. This seems to be one area where tuning the design
for flash can be a big win.
So there is little here to really encourage spending that weekend to
make UBIFS work well directly on flash. Except for the compression
advantage, we are unlikely to do much better than f2fs, which can be
used without that weekend of work. We would at least need to see how
compression performs on the processor found in the target device
before focusing too much on it.
As well as small files, I did some even simpler large-file tests. For
this, I wrote and subsequently read two large, already compressed,
files. One was an mp4 file with about one hour of video. The other was an
openSUSE 12.2 install ISO image. Together they total about 6GB. The total
times for each filesystem were:
| | ubifs | ubifs — NC | f2fs | NILFS2 | ext4 | Btrfs |
| write files | 850 | 876 |
838 | 1522 | 696 | 863 |
| read files | 1684 | 1539 | 571 | 574 | 571 | 613 |
The conclusions here are a bit different:
- Now ext4 is a clear winner on writes. It would be very
interesting to work out why. The time translates to about 8.8MB/sec which
is getting close to the theoretical maximum of 10MB/sec.
- Conversely, NILFS2 is a clear loser, taking nearly twice as long as the
other filesystems. Two separate runs showed similar results so it looks
like there is room for some performance tuning here.
- UBIFS is a clear loser on reads. This is probably because nodes
are not aligned to sectors so some extra reading and extra copying
is needed.
- The ability for UBIFS to compress data clearly doesn't help with these
large files. UBIFS did a little better with compression enabled,
suggesting that the files were partly compressible, but it wasn't
enough to come close to f2fs.
In summary, while f2fs appears to have room for improvement in some
aspects of the implementation, there seems little benefit to be
gained from pushing UBIFS into the arena of FTL-based devices. It
will likely remain the best filesystem for raw flash, while f2fs
certainly has some chance of positioning itself as the best filesystem
for FTL-based flash. However, we certainly shouldn't write off ext4 or
Btrfs. As noted earlier, these tests are not expected to give a firm
picture of these two filesystems so we cannot read anything conclusive
from them. However, it appears that both have something to offer, if
only we can find a way to isolate that something.
Comments (21 posted)
Patches and updates
Kernel trees
- Linus Torvalds: Linux 3.7 .
(December 11, 2012)
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>