Brief items
The current stable 2.6 kernel remains 2.6.30. The 2.6.31 merge window
is open (see below).
The current stable 2.6 kernel update is 2.6.29.5, released on June 15. 2.6.27.25
was released on June 11.
Both have a long list of fixes throughout the kernel.
Comments (none posted)
Kernel development news
A long ago, in days of yore, it all began with a god named Thor.
There were vikings and boats and some plans for a Linux kernel
header. Unfortunately, a single 8-bit field was used for
bootloader type and version.
--
H. Peter
Anvin; vikings never were good at forward compatibility
"I want" isn't a good policy for the introduction of ugly as sin long
term interfaces into the kernel. Besides which you argument doesn't
actually make sense anyway.
--
Alan Cox
This release, I'm only going to pull trees that don't do crazy shit.
--
Linus Torvalds draws a line in the sand
Comments (1 posted)
By Jonathan Corbet
June 17, 2009
The process of merging patches for 2.6.31 has begun. As of this writing,
6220 non-merge changesets have been added to the mainline. Some
of the more
interesting, user-visible change include:
- Reads from /dev/zero can now be interrupted by signals. In
theory, no application should be adversely affected by this change,
but is still a true user-space ABI change.
- The x86 architecture can now handle 46-bit physical addresses,
allowing the use of up to 64TB of physical memory.
- A number of Xen features meant to support Dom0 functionality
(the /dev/xen/evtchn event channel code ant
/sys/hypervisor, for example) have been merged. The Dom0
code itself remains outside, though.
- Support for the rt_tgsigqueueinfo() system call, which sends
a signal (with ancillary information) to a thread group, has been added.
- A number of ftrace features have been added. These include a function
profiler, a number of event tracer improvements, some new tracepoints,
and a new docbook document describing static tracepoints in the kernel.
- The USB TTY port driver has been reworked in ways which bring it into
closer alignment with POSIX behavior and that of other types of serial
ports. Alan Cox fears that some applications which depended on the
old behavior might break - though others, which had problems with USB
serial ports, should now begin to work.
- There is a new sysctl knob
(/proc/sys/kernel/modules_disabled); writing "1" to that file
will cause module loading to be forevermore disallowed.
- The SMACK security module now has a comprehensive logging facility.
- The splice() system call now works when both the input and
output files are pipes.
- The storage topology patches (covered briefly in April)
have been merged. This code allows the kernel to export much more
information about how storage devices are structured, enabling support
for upcoming hardware.
- The performance counters patch set (last covered in December)
have been merged. This code provides a new API for the use of
hardware performance counters; it edges out perfmon and a number of
other implementations in this area.
- The character devices in
user space (CUSE) patch set has been merged.
-
arch-imx has been removed from the ARM architecture
as it has been
superseded by the MXC architecture.
-
The s390 architecture has added support for dynamic ftrace, as well as the
system call and function graph tracers.
-
The packet_mmap changes for the transmission side of packet sockets was
merged, which allows for
more efficient, nearly zero-copy, packet sends from user space.
-
The controller area network (CAN) subsystem has added a network device
driver interface and a netlink interface. New CAN device drivers have also
been
merged using the driver interface (see below).
-
Support for IEEE 802.15.4 sockets has been added to the network subsystem.
This is for low-cost, low-speed "personal area networks".
-
Passive OS fingerprinting has been added to
the netfilter code.
-
The FAT filesystem has added an "errors" mount option which governs its
behavior in the presence of critical errors.
-
The s390 architecture has added support for hibernation.
-
Support has been added for USB 3.0/xHCI host controllers, though none are yet
available.
-
Kernel modesetting for the radeon driver, supporting R1XX, R2XX, R3XX,
R4XX, R5XX, and radeon up to X1950 hardware, has been merged.
- There is the usual pile of new drivers:
- Architectures/processors/systems: SuperH SH7724 processors,
Hitachi SH7724 (SolutionEngine7724) boards,
memory error detection and correction (EDAC) support for AMD K8,
F10h, and F11h processors, ARM PCM043 boards, ARM Atmark
Armadillo 500 boards, OMAP3 Zoom2 boards, OMAP4430 SDP boards
(including SMP support), ARM
MX35PDK boards, Marvell 88F6281 GTW GE boards,Samsung S3C6400 SoCs,
ARMv6/v7 big-endian, Stargate 2 boards, Freescale STMP platforms,
ST-Ericsson U300 series platforms, PowerPC MPC8569MDS boards, PowerPC
P2020DS boards.
- Miscellaneous: Timberdale FPGA UARTs, 64-bit VIA
hardware random number generators, Mediama RMTx add-on board for
ATNGW100, Wacom Graphire Bluetooth tablet, CB710/720 memory card
readers, Maxim 1586 voltage regulators, TI TMP401 temperature sensor,
Intel Langwell USB device controllers, Intel Langwell USB On-the-go
controllers, TI VLYNQ bus, ST Microelectronics DDC I2C interface.
- Networking: Broadcom NetXtremeII gigabit Ethernet cards
(offload features in particular), Intellon PLC (Powerline
communications) USB adapters, Marvell SD8688 wireless chips, Ralink
rt2800 wireless USB chipsets, TI wl1251/wl1271 wireless chipsets, TI
DaVinci Ethernet MACs, Phillips SJA1000 CAN
controllers, Kvaser PCIcanx and Kvaser PCIcan PCI CAN Cards, Intel
wireless Multicomm 3200 devices, Micrel KS8842 ethernet switches.
- Sound: Wolfson Micro WM8988, WM8940, WM9081,
and WM8960 codecs,
Digigram LX6464ES boards,
ESI Maya44 boards,
several Creative Sound Blaster X-FI devices based on the 20K1 and
20K2 chipsets, a USB Audio gadget driver.
- Graphics: AMD r600 chipset, Intel IGDNG chipset.
- Video: DVB-S/S2/DSS Multistandard Professional/Broadcast
demodulators, STV6110/(A) based tuners, ISL6423 SEC controllers, TI
THS7303 video amplifiers, Analog Devices I2C bus based ADV7343
encoders.
Changes visible to kernel developers include:
- There is a new atomic function:
int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock);
This function will decrement cnt and, if cnt reaches
zero, it will acquire the given lock.
- A number of block layer request
queue API changes have been merged; all drivers must now dequeue
requests before executing them. Beyond that, the merging of the
storage topology patches (in preparation for 4K-sector disks) mean
that block drivers must now distinguish between the physical block
size on the disk and the logical block size used by the kernel.
- The 32-bit x86 architecture now supports the atomic64_t
type.
- The kernel memory leak
detector has been merged at last.
- The fsnotify backend has been merged. This code provides a new,
common implementation for dnotify and inotify; it also will serve as
the base for the "fanotify" code (formerly TALPA), which has not been
merged as of this writing.
- Btrfs has seen a number of enhancements, including one which involves
an on-disk format change. Existing btrfs filesystems will work with
the new code, but, once mounted on a 2.6.31 system, those filesystems
will no longer work with older kernels.
- Tree read-copy update (RCU) is now the
default, though Classic RCU is still available.
- Changes to the include/asm-generic header files were
merged. These changes are meant to serve as a model for or be used
directly by new architectures
rather than copying from an existing architecture. The S+core (score)
architecture
depends on these changes and the MicroBlaze architecture will be using them
to clean up its ABI.
-
Some rather large cleanups for XFS were merged, including switching to the
generic POSIX ACL code and removal of the xfs_qmops quota ops
function vector.
-
All network drivers have converted to the new net_device_ops API and the
old API available with COMPAT_NET_DEV_OPS has been removed.
-
The rfkill core has been rewritten for devices that implement a way to stop
all radio transmission from the device (in response to a laptop key for
turning off wireless, for example). Various drivers have also been updated
to use the new rfkill API.
-
Debugfs has had all of its references throughout the tree turned into
/sys/kernel/debug/ in both documentation and code. In addition,
LWN's updated guide to debugfs was added to
the Documentation directory.
-
Unicode handling in the kernel has been updated, with functions like
utf_mbstowcs() being renamed to utf8s_to_utf16s() for
better readability.
-
The kmemcheck kernel memory checker to detect the use of uninitialized
memory has been merged.
-
The TTM GPU memory manager (covered a bit
over a year ago) has been merged.
Linus started merging patches on June 10, suggesting that the merge window
will remain open until sometime around June 24. That puts us roughly half
way through the merge window, in terms of time. The merge rate will likely
slow down some for the remainder of the merge window, but there are,
undoubtedly, more interesting changes to come. Stay tuned.
Comments (24 posted)
June 17, 2009
This article was contributed by Valerie Aurora (formerly Henson)
"What ever happened to chunkfs?" This is a question I hear every few
months, and I always answer the same way, "Chunkfs works, the overhead
is reasonable, and it is only practical if it is part of the file
system design from the beginning, not tacked on after the fact. I
just need to write up the paper summarizing all the data." Thanks to
your benevolent LWN editor, you are now reading that paper.
Background
Before we describe chunkfs, we should first talk about the problems
that led to its creation. Chunkfs was motivated by the growing
difficulty of reading and processing all the data necessary to check
an entire file system. As the capacity of storage grows, the capacity
of the rest of the system to read and check that data is not keeping
pace with that growth. As a result, the time to check a "normal" file
system grows, rather than staying constant as it would if the memory,
bandwidth, seek time, etc. of the system grew in proportion to its
storage capacity. These differential rates of growth in hardware -
like the differences between RAM capacity, bandwidth, and latency -
are part of what keeps systems research interesting.
To understand the change in time to check and repair a file system, a
useful analogy is to compare the growth of my library (yes, some
people still own books) with the growth of my ability to read,
organize, and find the books in the library. As a child, my books fit
all on one shelf. I could find the book I wanted in a matter of
seconds and read every single one of my books in a week. As an adult,
my books take up several bookshelves (in addition to collecting on any
flat surface). I can read many times faster than I could when I was a
kid, and organize my books better, but finding a book can take several
minutes, and reading them all would take me a year or more. If I was
trying to find a twenty dollar bill I left in a book, it would take me
several hours to riffle through all the books to find it. Even though
I am a better reader now, my library grew faster than my ability to
read or search my library. Similarly, computers are faster than ever,
but storage capacity grew even faster.
The consequence of this phenomenon that first caught my attention was
the enormous disparity between seek time and capacity of disks. I
calculated a projected change in fsck time given
these predictions
[PDF]
from a storage industry giant in 2006:
|
Improvement in disk performance, 2006 - 2013
|
| Capacity: | 16.0x |
| Bandwidth: | 5.0x |
| Seek time: | 1.2x |
| => at least 10x increase in fsck time! |
Since then, commodity flash-based storage finally became a practical
reality. Solid-state disk (SSDs) have much faster "seeks", somewhat improved bandwidth,
and drastically reduced capacity compared to disks, so the performance
of fsck ought to be extremely good. (I am unable to find any
measurements of fsck time on an SSD, but I would estimate it to be on
the order of seconds. Readers?) However, SSDs don't solve all our
problems. First, SSDs are an element in the cache hierarchy of a
system, layered between system RAM and disk, not a complete
replacement for disks. The sweet spot for SSDs will continue to
expand, but it will be many years or decades before disks are
completely eliminated - look at the history of tape. It's easy for a
laptop user to wave their hands and say, "Disks are obsolete," but try
telling that to Google, your mail server sysadmin, or even your
MythTV box.
Second, remember that the whole reason we care about fsck in the first
place is because file systems get corrupted. One important source of
file system corruption is failures in the storage hardware itself.
Ask any SSD manufacturer about the frequency of corruption on its
hardware and they will inundate you with equations, lab
measurements, and simulations showing that the flash in their SSDs will
never wear out in "normal" use - but they also won't reveal the
details of their wear-leveling algorithms or the workloads they used
to make their predictions. As a result, we get surprises in
real-world use, both in performance and in reliability. When it comes
to failure rates, we don't have good statistics yet, so I have to fall
back on my experience and that of my kernel hacker friends. What we
see in practice is that SSDs corrupt more often and more quickly than
disks, and prefer to devour the tastiest, most crucial parts of the
file system. You can trust the hardware manufacturers when they wave
their hands and tell you not to worry your pretty little head, but I
put more weight on the money I've made consulting for people with
corrupted SSDs.
So, if corruption is a concern, an important benefit of the chunkfs
approach is that file system corruption is usually limited to a small part
of the file system, regardless of the source of the
corruption. Several other repair-driven file system principles - such
as duplicating and checksumming important data - make recovery of data
after corruption much more likely. So if you don't care about fsck
time because you'll be using an SSD for the rest of your life, you
might still read on because you care about getting your data back from
your SSD.
The conclusion I came to is that file systems should be designed with
fast, reliable check and repair as a goal, from the ground up. I
outline some useful methods for achieving this goal in a short
paper: Repair-driven
File System Design [PDF].
Chunkfs design
Chunkfs is the unmusical name for a file system architecture designed
under the assumption that the file system will be corrupted at some
point. Therefore the on-disk format is optimized not only for
run-time performance but also for fast, reliable file system check and
repair. The basic concept behind chunkfs is that a single logical
file system is constructed out of multiple individual file systems
(chunks), each of which can be checked and repaired (fscked)
individually.
This is great and all, but now we have a hundred little file systems,
with the namespace and disk space fragmented amongst them - hardly an
improvement. For example, if we have a 100GB file system divided into
100 1GB chunks, and we want to create a 2GB file, it won't fit in a
single chunk - it has to somehow be shared across multiple chunks. So
how do we glue all the chunks back together again so we can share the
namespace and disk space while preserving the ability to check and
repair each chunk individually? What we want is a way to connect a
file or directory in one chunk and another file or directory in
another chunk in such a way that the connection can be quickly and
easily checked and repaired without a full fsck of the rest of the
chunk. The solution is something we named a "continuation inode". A
continuation inode "continues" a file or directory into another chunk.
You'll notice that there are two arrows in the picture to the left, one
pointing
from the original inode to the continuation inode, and another
pointing back from the continuation inode to the original inode. When
checking the second chunk, you can check the validity of the
continuation inode quickly, by using the back pointer to look up the
original inode in its chunk. This is a check of a "cross-chunk
reference" - any file system metadata that makes a connection between
data in two different chunks. Cross-chunk references must always have
forward and back pointers so that they can be verified starting from
either chunk.
Cross-chunk references must satisfy one other requirement: You must be
able to quickly find all cross-references from chunk A to arbitrary
chunk B, without searching or checking the entire chunk A. To see why
this must be, we'll look at an example. Chunk A has an inode X which
is continued into chunk B. What if chunk A was corrupted in such a
manner that inode X was lost entirely? We would finish checking and
repairing chunk A, and it would be internally self-consistent, but
chunk B would still have a continuation inode for X that is now
impossible to look up or delete - an orphan continuation inode. So as
the second step in checking chunk A, we must then find all references
to chunk A from chunk B (and all the other chunks) and check that they
are in agreement. If we couldn't, we would have to search every other
chunk in the file system to check a single chunk - and that's not
quick.
A chunkfs-style file system can be checked and repaired incrementally
and online, whereas most file systems must be checked all at once
while the file system is offline. Some exceptions to this general
rule do exist, such as the BSD FFS snapshot-based online check, and an
online self-healing mechanism for NTFS, but in general these
facilities are hacked on after the fact and are severely limited in
scope. For example, if the BSD online check actually finds a problem,
the file system must be unmounted and repaired offline, all at once,
in the usual way, including a rerun of the checking stage of fsck.
The chunkfs design requires solutions to several knotty problems we
don't have room to cover in this article, but you can read about in our
2006 Hot Topics in
Dependability paper:
Chunkfs: Using
divide-and-conquer to improve file system reliability and
repair [PDF].
Measurements
Repair-driven file system design, chunkfs in particular, sounds like a
neat idea, but as a file system developer, I've had a lot of neat
ideas that turned out to be impossible to implement, or had too much
overhead to be practical. The questions I needed to answer about
chunkfs were: Can it be done? If it can be done, will the overhead of
continuation inodes outweigh the benefit? In particular, we need to
balance the time it takes to check an individual chunk with the time
it takes to check its connections to other chunks (making sure that
the forward and back pointers of the continuation inodes agree with
their partners in the other chunk). First, let's take a closer look
at file system check time on chunkfs.
The time to check and repair a file system with one damaged chunk is
the sum of two different components, the time to check one chunk
internally and the time to check the cross references to and from this
chunk.
Tfs = Tchunk + Tcross
The time to check one chunk is a function of the size of the chunk,
and the size of the chunk is the total size of the file system divided
by the number of chunks.
Tchunk = f(sizechunk)
sizechunk = sizefs / nchunks
The time to check the cross-chunk references to and from this chunk
depends on the number of those cross references. The exact number of
cross-chunk references will vary, but in general larger chunks will
have fewer cross references and smaller chunks will have more - that
is, cross chunk check time will grow as the number of chunks grow.
Tcross = f(nchunks)
So, per-chunk check time gets smaller as we divide the file system
into more chunks, and at the same time the cross-chunk check time
grows larger. Additionally, the extra disk space taken up by
continuation inodes grows as the number of chunks grows, as does the
overhead of looking up and following continuation inodes during normal
operation. We want to find a sweet spot where the sum of the time to
check a chunk and its cross-references is minimized, while keeping the
runtime overhead of the continuation inodes small.
Does such a sweet spot exist? The answer depends on the layout of
files and directories on disk in real life. If cross-chunk references
are extremely common, then the overhead of continuation inodes will
outweigh the benefits. We came up with an easy way to estimate the
number of cross-chunk references in a chunkfs file system: Take a
"real" in-use ext3 file system and for each file, measure the number
of block groups containing data from that file. Then, for each
directory, measure the number of block groups containing files from
that directory. If we add up the number of block groups less one for
all the files and directories, we'll get the number of cross-chunk
references in a similar chunkfs file system with chunks the size of
the ext3 file system's block groups
(details
here).
Karuna Sagar wrote a tool to measure these cross-references,
called cref, and I added some code to do a worst-case
estimate of the time to check the cross-references (assuming one seek
for every cross-reference).
The results
were encouraging; assuming disk hardware progresses as predicted, the
average cross-chunk cross-reference checking time would be about 5
seconds in 2013, and the worst case would be about 160 seconds (about
2.5 minutes). This is with a 1GB chunk size, so the time to check the
chunk itself would be a few seconds. This estimate is worst-case in
another way: the ext3 allocator is in no way optimized to reduce
cross-chunk references. A chunkfs-style file system would have a more
suitable allocator.
Implementations
Chunkfs was prototyped three times. The first
prototype, written by Amit Gud
for his master's
thesis [PDF] in 2006-2007, implemented chunkfs as modifications
to the ext2 driver in FUSE. Note that the design of chunkfs described
in the thesis is out of date in some respects, see the chunkfs paper [PDF]
for the most recent version. He also ported this
implementation to the kernel in mid-2007, just for kicks. The results
were encouraging. Our main concern was that continuation inodes would
proliferate wildly, overwhelming any benefits. Instead, files with
continuation inodes were uncommon - 2.5% of all files in the file systems -
and no individual file had more than 10 continuation inodes. The test
workload included some simulation of aging by periodically deleting files
while filling the test file system. (It would be interesting to see the
results from using Impressions
[PDF], a very sophisticated tool for generating realistic file system
images.)
These implementations were good first steps, but they were based on an
earlier version of the chunkfs design, before we had solved some
important problems. In these implementations, any chunk that had been
written to since the file system was mounted had to be fscked after a
crash. Given that most of the file system is idle in common usage,
this reduced check time by about 2/3 in the test cases, but we are
looking for a factor of 100, not a factor of 3. It also lacked a
quick way to locate all of the references into a particular damaged
chunk, so it only checked the references leading out of the chunk. It
used an old version of the solution for hard links which would allow
hard links to fail if the target's chunk ran out of space, instead of
growing a continuation for the target into a chunk that did have
space. It was a good first step, but lacked a key feature:
drastically reduced file system repair time.
In mid-2007, I decided to write a prototype of chunkfs as a layered
file system, similar to unionfs or ecryptfs, that was completely
independent of the client file system used in each chunk.
Continuation inodes are implemented as a regular files, using extended
attributes to store the continuation-related metadata (the forward and
back pointers and the offset and length of the file stored in this
continuation inode). When the file data exceeds an arbitrary size
(40K in my prototype), a continuation inode is allocated in another
chunk and the data after that point in the file is stored in that
inode's data. All of the continuation inodes emanating from a
particular chunk are kept in one directory so that they can be quickly
scanned during the cross-chunk pass of fsck.
To test that the prototype could recover from file system corruption
in one chunk without checking the entire file system, I implemented
fsck as a simple shell script. First, it fscks the damaged chunk by
running the ext2 fsck on it. This may end up deleting or moving
arbitrary files in the chunk, which could make it out of sync with the
other chunks. Second, it mounts the now-repaired file systems and
reads their /chunks directories to find all connections
to the damaged chunk and consistency check them. If it finds an
orphaned continuation - a continuation whose origin inode in the
damaged chunk was destroyed or lost - then it deletes that
continuation inode.
The fsck script is deficient in one particular aspect: it checks all
the chunks because I didn't write the code to mark a chunk as damaged.
This is a difficult problem in general; sometimes we know that a chunk
has been damaged because the disk gave us an IO error, or we found an
inconsistency while using the file system, and then we mark the file
system as corrupted and needing fsck. But plenty of corruption is
silent - how can we figure out which chunk was silently corrupted? We
can't, but we can quietly "scrub" chunks by fscking them in the background.
Currently, ext2/3/4 triggers a paranoia fsck every N mounts or M days
since the last check. Unfortunately, this introduces an unexpected
delay of minutes or hours at boot time; if you're lucky, you can go
have a cup of coffee while it finishes, if you're unlucky, you'll be
figuring out how to disable it while everyone who has come to your
talk about improvements in Linux usability watches you. (N.B.: Reboot
and add "fastboot" to the kernel command line.) With chunkfs, we could
run fsck on a few chunks at every boot, adding a few seconds to every
boot but avoiding occasional long delays. We could also fsck inactive
chunks in the background while the file system in use.
Results
I gave a talk at LCA in January 2008 about chunkfs which included a
live demo of
the final
chunkfs prototype in action: creating files, continuing them into
the next chunk, deliberately damaging a chunk, and repairing the file
system. Because I carefully prepared a typescript-based fallback in
case things went sideways, the demo went perfectly. A video
of the talk [Ogg Theora] is available.
Note that I have never been able to bring myself to watch this talk,
so I don't know if it contains any embarrassing mistakes. If you
want, you can follow along during the demo section with
the annotated
output from the demo.
The collection-of-file-systems approach ended up being more complex
than I had hoped. The prototype used ext2 as the per-chunk file
system - a reasonable choice for some throw-away code, but not
workable in production. However, once you switch to any reliable file
system, you end up with one journal per chunk, which is quite a lot of
overhead. In addition, it seemed likely we'd need another journal to
recover from crashes during cross-chunk operations. Another source of
complexity was the use of a layered file system approach - basically,
the chunkfs layer pretends to be a client file system when it's
talking to the VFS, and pretends to be the VFS when it's talking to
the per-chunk file system. Since none of this is officially
supported, we end up with a long list of hacks and workarounds, and
it's easy to run into surprise reference counting or locking bugs.
Overall, the approach worked for a prototype, but it didn't seem worth
the investment it would to make it production quality, especially with
the advent
of btrfs.
Future work
When I began working on chunkfs, the future of Linux file systems
development looked bleak. The chunkfs in-kernel prototype looked like
it might be the only practical way to get repair-driven design into a
Linux file system. All that has changed; file system development is a
popular, well-funded activity and Linux has a promising
next-generation file system, btrfs, which implements many principles
of repair-driven file system design, including checksums, magic
numbers, and redundant metadata. Chunkfs has served its purpose as a
demonstration of the power of the repair-driven design approach and I
have no further development plans for it.
Conclusions
The three chunkfs prototypes and our estimates of cross-chunk
references using real-world file systems showed that the chunkfs
architecture works as advertised. The prototypes also convinced us
that it would be difficult to retrofit existing journaling file
systems to the chunkfs architecture. Features that make file system
check and repair are best when designed into the architecture of the
file system from the beginning. Btrfs is an example of a file system
designed from the ground up with the goal of fast, reliable check and
repair.
Credits
Many people and organizations generously contributed to the design and
prototyping of chunkfs. Arjan van de Ven was the co-inventor of the
original chunkfs architecture. Theodore Ts'o and Zach Brown gave
invaluable advice and criticism while we were thrashing out the
details. The participants of the Hot Topics in Dependability Workshop
gave us valuable feedback and encouragement. Karuna Sagar wrote the
cross-reference measuring tool that gave us the confidence to go
forward with an implementation. Amit Gud wrote two prototypes while a
graduate student at Kansas State University. The development of the
third prototype was funded at various points by Intel, EMC,
and VAH Consulting. Finally,
too many people to list made valuable contributions during discussions
about chunkfs on mailing lists, and I greatly appreciate their help.
Comments (37 posted)
June 17, 2009
This article was contributed by Goldwyn Rodrigues
Errors in the storage subsystem can happen due to a variety of known
or unknown reasons. A filesystem turns read-only when it encounters
errors in the storage subsystem, or a code path which the filesystem
code base should not have taken (i.e. a BUG() path). Making the filesystem
read-only is a safeguard feature that filesystems implement to avoid
further damage because of the errors encountered. Filesystems that change to
read-only end up stalling services relying on data writes and, in some
cases, may lead to an unresponsive system. In embedded devices, dying
applications due to a filesystem turning read-only may render the device
useless, leaving the user confused.
Filesystem errors can either be recoverable or non-recoverable.
Errors from bad block links in the inode data structures or block
pointers can be recovered by filesystem checks. On the other hand, errors due to
memory corruption, or a programming error, might not be recoverable
because one cannot be sure which data is correct.
Denis Karpov proposed a patchset that would notify
user space, so that a user-space policy can be defined to take the
appropriate action to avoid turning the filesystem read-only. The
patchset is currently limited to FAT filesystems. User-space
notifications would allow a filesystem to continue to be used after
encountering "correctable errors" if proactive measures are taken to
correct them. Such corrective
actions can obviate the need for lengthy filesystem checks when the
device is mounted next.
The patchset adds a /sys/fs/fat/<volume>/fs_fault file which is
initialized
to 0 when the filesystem is mounted. The value of fs_fault changes
to 1 on
error. A user-space program can poll() on this file to check if the
value of the file changed which is an indication that an error has occurred.
Besides the file polling, a KOBJ_CHANGE uevent is generated, with
the uevents environment variable FS_UNCLEAN set to 0 or 1. A udev
rule can then
trigger the correct program to take care of the damage done by the
error.
Kay Sievers is not
convinced with the idea of using
uevents for user-space notifications, as uevents are designed for
device notifications, so they do not fit the design goals of reporting
filesystem errors. Filesystem errors are quite often a repeated series
of events. For example, a read failure may result in printing multiple
read errors in dmesg for each block it is not able to read. An event
generated for each block may be too much for udev to handle. Some of
the events may get lost, or worse still, may cause udev to ignore
uevents from other devices which occurred during the burst of
errors.
Uevents have no state, and the information is lost after the event.
Uevents can not block, they need to finish in userspace immediately,
you can not queue them up or anything else, it would block other
operations. Uevents can _never_ be used to transport high frequent
event streams. They might render the entire system unusable, if you
have lots of devices and many errors.
They could be used to get attention when a superblock does a one-time
transition from "clean" to "error", everything else would just get us
into serious trouble later.
Keeping <volume>/fs_fault in sysfs is also not the best solution,
because sysfs is unaware of filesystem namespaces. The primary
responsibility of sysfs is to expose core kernel objects. Filesystem
namespaces are a set of filesystem mounts that are only visible to a
particular process and may be invisible to the rest of the processes.
A process with a private namespace contains a copy of the namespace
instead of a shared namespace. When the system starts, it contains one
global namespace which is shared among all processes. Mounts and
unmounts in a shared namespace are seen by all the processes in the
namespace. A process creates a private namespace if it was created by
the clone() system call using the CLONE_NEWNS flag (clone New
NameSpace). A process sharing a public namespace can also create a private
namespace by calling unshare() with CLONE_NEWNS flag. Mounts and
unmounts within a private namespace are only seen by the process that
created the namespace. A child process created by fork() shares its
parent's namespace.
Because of this, processes might receive errors on a filesystem in a
different namespace, so they would not know which device to act on.
The problem is also noticeable with processes accessing bind
mounts created in a different namespace (bind mounts are a feature in which a
sub-tree of a filesystem can be mounted on another directory).
Moreover, filesystems spanning multiple devices, such as btrfs, would
not be able to report the device name under the current naming
structure.
Kay recommends
/proc/self/mountinfo as a better alternative,
because it contains the list of mount points in the namespace of the
process with the specified PID (self).
Currently, /proc/self/mountinfo changes when the
mount tree changes. This can be extended to propagate errors to
user space in the correct namespace using poll(), regardless of
the device name.
/proc/self/mountinfo can accommodate optional fields which hold values
in the form of "tag[:value]" that can be used to communicate the nature of
the problem. Instead of using the existing udev infrastructure, this would require a dedicated application to monitor
/proc/self/mountinfo, identify the error by parsing the argument,
and act
accordingly.
Jan Kara further suggests
using /proc/self/mountinfo as a link to identify the filesystem device
generating the errors:
What currently seems as the cleanest solution to me, is to add some
"filesystem identifier" field to /proc/self/mountinfo (which could be
UUID, superblock pointer or whatever) and pass this along with the
error message to user-space. Passing could be done either via sysfs (but I
agree it isn't the best fit because a filesystem need not be bound to a
device) or just via generic netlink (which has the disadvantage that you
cannot use the udev framework everyone knows)...
Despite these objections, everyone agrees that error reporting to
user space must not be limited to dmesg messages. User space
should be notified of the errors reported by the filesystem, so that it can
proactively handle errors and try to correct them. The namespace-unaware
/sys filesystem or notifications through uevent may not be the best
solution, but, for a lack of a better alternative interface,
the developers used sysfs and uevents. The design is still under
discussion, and will take some time to evolve, though it is likely
that some kind of solution to this problem will make its way into the kernel.
Comments (3 posted)
June 12, 2009
This article was contributed by Neil Brown
Last week we discussed the
value of enunciating kernel design patterns
and looked at the design patterns surrounding reference counts.
This week we will look at a very different aspect of coding and see
why the kernel has special needs, and how those needs have been
addressed by successful approaches. The topic under the microscope
today is complex data structures.
By "complex data structures" we mean objects that are composed of a
variable number of simpler objects, possibly a homogeneous
collection, possibly a heterogeneous collection. In general it is a
structure to which objects can be added, from which objects can be
removed, and in which objects can be found.
The preferred way to work with such data structures when we study them in an
introductory programming course is to use Abstract Data
Types.
Abstract Data Types
The idea behind Abstract Data Types is to encapsulate the entire
implementation of a data structure, and provide just a well defined
interface for manipulating it.
The benefit of this approach is that it provides a clean separation.
The data type can be implemented with no knowledge of the application
which might end up using it, and the application can be implemented
with no knowledge of the implementation details of the data type.
Both sides simply write code based on the interface which works like a
contract to explicitly state what is required and what can be
expected.
On the other hand,
one of the costs of this approach is tightly connected with the
abstract nature of the interface. The point of abstracting an
interface is to remove unnecessary details. This is good for
introductory computing students, but is bad for kernel programmers.
In kernel programming. performance is very important, coming as a
close third after correctness and maintainability, and sometimes
taking precedence over maintainability. Not all code paths in the
kernel are performance-critical, but many are, and the development
process benefits from the same data structures being using in both
performance critical and less critical paths. So it is essential that
data types are not overly abstracted, but that all details of the
implementation are visible so that the programmer can make optimal choices
when using them.
So the first principle of data structures in the kernel is not to hide
detail. To see how this applies, and to discover further principles
from which to extract patterns, we will explore a few of the more
successful data types used in the Linux kernel.
Linked Lists
Starting simply, the first data type we will explore are doubly linked
lists. These are implemented by a single include file,
<linux/list.h>.
There is no separate ".c" file with any library of support routines.
All of the code for handling linked lists is simple enough to be
implemented using inline functions. Thus it is very easy to use this
implementation in any other (GPLv2-licensed) project.
There are two aspects of the "list.h" lists which are worth noting as
they point to possible patterns.
The first is struct list_head, which serves not only as the head of a
list, but also as the anchor in items that are on a list.
Your author has seen other linked list implementations which required
that the first and second element in any data structures meant to be stored
in lists be the "next" and
"previous" pointers, so that common list-walking code could be used on a variety
of different data structures. Linux kernel lists do not suffer from this
restriction. The list_head structure can be embedded anywhere in a
data structure, and the list_heads from a number of instances of
that structure can be linked together. The containing structure can be
found from a ->next or ->prev pointer using
the list_entry() macro.
There are at least two benefits of this approach. One is that
the programmer still has full control of placement of fields in the
structure in case they need to put important fields close together to
improve cache utilization. The other is that a structure can easily
be on two or more lists quite independently, simply by having multiple
struct list_head fields.
This practice of embedding one structure inside another and using
container_of() (which is the general form of list_entry()) to get the
parent from the child structure is quite common in the Linux kernel
and is somewhat reminiscent of object oriented programming. The
container is like a subtype of the embedded structure.
The other noteworthy aspect of list.h is the proliferation of
"for_each" macros - the macros that make it easy to walk along a list
looking at each item in turn. There are 20 of them (and that isn't
counting the four more in rculist.h which I'll choose to ignore in the
hope of brevity).
There are a few different reasons for this. The simplest are that
- We sometimes want to walk the list in the "reverse"
direction (following the "prev" link). There are five macros that go
backward, and 15 that go forward.
- We sometimes want to start in the middle of a list and
"continue" on from there, so we have four "continue" macros and
three "from" macros which interpret that starting point slightly
differently.
- We sometimes want to work with the struct list_head embedded in
the target structure, but often we really want to use the
list_entry() macro to get the enclosing structure; we find it
easiest if the "for_each" macro does that for us. This provides
the "entry" versions of the "for_each" macro, of which there are
13 (more than half).
Getting to the more subtle reasons, we sometimes want to be able to
delete the "current" item without upsetting the walk through the
list. This requires that a copy of the "next" pointer be taken before
providing "this" entry to be acted upon, thus yielding the eight "safe"
macros. An "ADT" style implementation of linked lists would quite
likely only provide "safe" versions so as to hide these details.
However kernel programmers don't want to waste the storage or effort
for that extra step in the common case were it isn't needed.
Then there is the fact that we actually have two subtly different
types of linked lists. Regular linked lists use struct list_head as
the head of the list. This structure contains a pointer to the start and to the
end. In some use cases, finding the end of the list is not needed,
and being able to halve the size of the head of the list is very
valuable. One typical use case of that kind is a hash table where all these
heads need to go in an array. To meet this need, we have the hlist,
which is very similar to the regular list, except that only one
pointer is needed in struct hlist_head. This accounts for six of the
different "for_each" macros.
If we had every possibly combination of forward or reverse, continue
or not, entry or not, safe or not, and regular or hlist, we would have
32 different macros. In fact, only 19 of these appear to have been
needed and, thus, coded. We certainly could code the remaining eleven, but as
having code that is never used tends to be frowned upon, it hasn't
happened.
The observant reader will have noticed a small discrepancy in some of
the above numbers. Of the 20 macros, there is one that doesn't fit
the above patterns, and it drives home the point that was made earlier
about kernel programmers valuing performance.
This final "for_each" macro is __list_for_each(). All of the other
macros use the "prefetch" function to suggest that the CPU starts
fetching the ->next pointer at the start of each iteration so that
it will already be available in cache when the next iteration starts
(though the "safe" macros actually fetch it rather than prefetch it).
While this will normally improve performance, there are cases when it
will slow things down. When the walk of the list will almost always
abort very early - usually only considering the first item - the
prefetch will often be wasted effort. In these cases (currently all
in the networking code) the __list_for_each() macro is available. It
does not prefetch anything. Thus people having very strict performance
goals can have a better chance of getting the performance they want.
So from this simple data structure we can see two valuable patterns
that are worth following.
- Embedded Anchor: A good way to include generic objects in a data
structure is to embed an anchor in them and build the data structure
around the anchors. The object can be found from the anchor using
container_of().
- Broad Interfaces:
Don't fall for the trap of thinking that "one size fits all".
While having 20 or more macros that all do much the same thing is
uncommon, it can be a very appropriate way of dealing with the
complexity of finding the optimal solution. Trying to squeeze
all possibilities into one narrow interface can be inefficient and
choosing not to provide for all possibilities is counter-productive.
Having all the permutations available encourages developers to use
the right tool for the job and not to compromise.
In 2.6.30-rc4, there are nearly 3000 uses of list_for_each_entry(),
about 1000 of list_for_each_entry_safe(), nearly 500 of
list_for_each(), and less than 1000 of all the rest put together.
The fact that some are used rarely in no way reduces their importance.
RB-trees
Our next data structure is the RB-Tree or red-black tree.
This is a semi-balanced, binary search tree that generally provides
order "log(n)" search, insert, and delete operations. It is implemented in
<linux/rbtree.h> and lib/rbtree.c.
It has strong similarities to the list.h lists in that it embeds an
anchor (struct rb_node) in each data structure and builds the tree
from those.
The interesting thing to note about rbtree is that there is no search
function. Searching an rbtree is really a very simple operation and
can be implemented in just a few lines as shown by the examples at the
top of rbtree.h.
A search function certainly could be written, but the developer chose
not to. The main reason, which should come as no surprise, is
performance.
To write a search function, you need to pass the "compare" function
into that search function. To do that in C, you would need to pass a
function pointer. As compare functions are often very simple, the
cost of following the function pointer and making the function call
would often swamp the cost of doing the comparison itself.
It turns out that having the whole search operation compiled as one
function makes for more efficient code.
The same performance could possibly be achieved using inline functions
or macros, but given that the search function itself is so short, it
hardly seems worthwhile.
Note also that rbtree doesn't exactly provide an insert function
either. Rather, the developer needs to code a search; if the search
fails, the new node must be inserted at the point where it was found
not to exist and the tree must be rebalanced. There are
functions for this final insertion and rebalancing as they are
certainly complex enough to deserve separate functions.
By giving the developer the responsibility for search and for some of
insert, the rbtree library actually is giving a lot of valuable
freedom. The pattern of "search for an entry but if it isn't there,
insert one" is fairly common. However the details of what happens
between the "search" and "add" phases is not always the same and so
not something that can easily be encoded in a library. By providing
the basic tools and leaving the details up to the specific situation,
users of rbtree find themselves liberated, rather than finding
themselves fighting with a library that almost-but-not-quite does what
they want.
So this example of rbtrees re-enforces the "embedded anchors" pattern
and suggests a pattern that providing tools is sometimes much more
useful than providing complete solutions. In this case, the base data
structures and the tools required for insert, remove, and rebalance are
provided, but the complete solution can still be tailored to suit each
case.
This pattern also describes the kernel's approach to hash tables.
These are a very common data structure, but there is nothing that
looks even vaguely like a definitive implementation. Rather the basic
building blocks of the hlist and the array are available along with
some generic functions for calculating a hash (<linux/hash.h>).
Connecting these to fit a given purpose is up to the developer.
So we have another pattern:
- Tool Box:
Sometimes it is best not to provide a complete solution for a
generic service, but rather to provide a suite of tools that can be
used to build custom solutions.
Radix tree
Our last data structure is the Radix tree.
The Linux kernel actually has two radix tree implementations.
One is in <linux/idr.h> and lib/idr.c, the other in
<linux/radix-tree.h> and lib/radix-tree.c.
Both provide a mapping from a number (unsigned long) to an arbitrary
pointer (void *), though radix-tree also allows up to two "tags" to be
stored with each entry. For the purposes of this article we will only
be looking at one of the implementations (the one your author is most
familiar with) - the radix-tree implementation.
Radix-tree follows the pattern we saw in list.h of having multiple
interfaces rather than trying to pack lots of different needs into the
one interface. list.h has 20 "for_each" macros; radix-tree has six
"lookup" functions, depending on whether we want just one
item or a range (gang lookups), or whether we want to use the tags
to restrict the search (tag lookups) and whether we want to find the
place where the pointer is stored, rather than the pointer that is
stored there (this is needed for the subtle locking strategies of the
page cache).
However radix-tree does not follow the embedded anchor pattern of the
earlier data structures, and that is why it is interesting.
For lists and rbtree, the storage needed for managing the data
structure is exactly proportional to the number of items in the
data structure on a one-to-one basis, so keeping this storage in those
item works perfectly. For a radix-tree, the storage needed is a
number of little arrays, each of which refers to a number of
items. So embedding these arrays, one each, in the items cannot
work.
This means that, unlike list.h and rbtree, radix-tree will sometimes
need to allocate some memory in order to be able to add items to the
data structure. This has some interesting consequences.
In the previous data structures (lists and rbtrees), we made no mention
of locking. If locking is needed, then the user of the data
structure is likely to know the specific needs so all locking details are
left up to the caller (we call that "caller locking" as opposed to
"callee locking". Caller locking is more common and generally
preferred). This is fine for lists and rbtrees as nothing that they
do internally is affected particularly by locking.
This is not the
case if memory allocation is needed, though.
If a process needs to allocate memory, it is possible that it will
need to sleep while the memory management subsystem writes data out to
storage to make memory
available. There are various locks (such as spinlocks) that may not
be held while a process sleeps. So there is the possibility for
significant interaction between the need to allocate memory internally
to the radix-tree code, and the need to hold locks outside the
radix-tree code.
The obvious solution to this problem (once the problem is understood) is to
preallocate the maximum amount of memory needed before taking the
lock. This is implemented within radix-tree with the
radix_tree_preload() function. It manages a per-CPU pool of available
radix_tree nodes and makes sure the pool is full and will not be used
by any other radix-tree operation. Thus, bracketing radix_tree_insert()
with calls to radix_tree_preload() and
radix_tree_preload_end() ensures that the
radix_tree_insert() call will not fail due to lack of memory (though the
radix_tree_preload() might) and that there will be no unpleasant
interactions with locking.
Summary
So we can now summarize our list of design patterns that we have found
that work well with data structures (and elsewhere) in the kernel.
Those that have already been detailed are briefly included in this list
too for completeness.
- Embedded Anchor. This is very useful for lists, and can be
generalized as can be seen if you explore kobjects (an exercise
left to the reader).
- Broad Interfaces. This reminds us that trying to squeeze lots of
use-cases in to one function call is not necessary - just
provide lots of function calls (with helpful and (hopefully)
consistent names).
- Tool Box.
Sometimes it is best not to provide a complete solution for a
generic service, but rather to provide a suite of tools that can be
used to build custom solutions.
- Caller Locks. When there is any doubt, choose to have the caller
take locks rather than the callee. This puts more control in
that hands of the client of a function.
- Preallocate Outside Locks. This is in some ways fairly obvious.
But it is very widely used within the kernel, so stating it
explicitly is a good idea.
Next week we will complete
our investigation of kernel design patterns
by taking a higher level view and look at some patterns relating to
the design of whole subsystems.
Exercises
For those who would like to explore these ideas further, here are some
starting points.
-
Make a list of all data structures that are embedded, by exploring
all uses of the "container_of" macro. Of these, make a list of
pairs that are both embedded in the same structure (modeling multiple
inheritance). Comment on how this reflects on the general
usefulness of multiple inheritance.
-
Write a implementation of skiplists
that would be suitable for in-kernel use. Consider the
applicability of each of the patterns discussed in this article.
Extra credit for leveraging list.h lists.
-
Linux contains a mempool library which radix-tree chooses not to
use, preferring it's own simple pool (in radix_tree_preload).
Examine the consequences of changing radix-tree to use mempool, and
of changing mempool to be usable by radix-tree.
Provide patches to revert this design choice in radix-tree, or a
pattern to explain this design choice.
-
Compare the radix-tree and idr implementations to see if one could
be implemented using the other without sacrificing correctness,
maintainability or performance. Provide either an explanation of why
they should stay separate, or a patch to replace one with the other.
Comments (18 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Virtualization and containers
Miscellaneous
Page editor: Jake Edge
Next page: Distributions>>