Kernel development
Brief items
Kernel release status
The current stable 2.6 kernel remains 2.6.30. The 2.6.31 merge window is open (see below).
The current stable 2.6 kernel update is 2.6.29.5, released on June 15. 2.6.27.25 was released on June 11. Both have a long list of fixes throughout the kernel.
Kernel development news
Quotes of the week
2.6.31 merge window, week 1
The process of merging patches for 2.6.31 has begun. As of this writing, 6220 non-merge changesets have been added to the mainline. Some of the more interesting, user-visible change include:
- Reads from /dev/zero can now be interrupted by signals. In
theory, no application should be adversely affected by this change,
but is still a true user-space ABI change.
- The x86 architecture can now handle 46-bit physical addresses,
allowing the use of up to 64TB of physical memory.
- A number of Xen features meant to support Dom0 functionality
(the /dev/xen/evtchn event channel code ant
/sys/hypervisor, for example) have been merged. The Dom0
code itself remains outside, though.
- Support for the rt_tgsigqueueinfo() system call, which sends
a signal (with ancillary information) to a thread group, has been added.
- A number of ftrace features have been added. These include a function
profiler, a number of event tracer improvements, some new tracepoints,
and a new docbook document describing static tracepoints in the kernel.
- The USB TTY port driver has been reworked in ways which bring it into
closer alignment with POSIX behavior and that of other types of serial
ports. Alan Cox fears that some applications which depended on the
old behavior might break - though others, which had problems with USB
serial ports, should now begin to work.
- There is a new sysctl knob
(/proc/sys/kernel/modules_disabled); writing "1" to that file
will cause module loading to be forevermore disallowed.
- The SMACK security module now has a comprehensive logging facility.
- The splice() system call now works when both the input and
output files are pipes.
- The storage topology patches (covered briefly in April)
have been merged. This code allows the kernel to export much more
information about how storage devices are structured, enabling support
for upcoming hardware.
- The performance counters patch set (last covered in December)
have been merged. This code provides a new API for the use of
hardware performance counters; it edges out perfmon and a number of
other implementations in this area.
- The character devices in
user space (CUSE) patch set has been merged.
-
arch-imx has been removed from the ARM architecture
as it has been
superseded by the MXC architecture.
-
The s390 architecture has added support for dynamic ftrace, as well as the
system call and function graph tracers.
-
The packet_mmap changes for the transmission side of packet sockets was
merged, which allows for
more efficient, nearly zero-copy, packet sends from user space.
-
The controller area network (CAN) subsystem has added a network device
driver interface and a netlink interface. New CAN device drivers have also
been
merged using the driver interface (see below).
-
Support for IEEE 802.15.4 sockets has been added to the network subsystem.
This is for low-cost, low-speed "personal area networks".
-
Passive OS fingerprinting has been added to
the netfilter code.
-
The FAT filesystem has added an "errors" mount option which governs its
behavior in the presence of critical errors.
-
The s390 architecture has added support for hibernation.
-
Support has been added for USB 3.0/xHCI host controllers, though none are yet
available.
-
Kernel modesetting for the radeon driver, supporting R1XX, R2XX, R3XX,
R4XX, R5XX, and radeon up to X1950 hardware, has been merged.
- There is the usual pile of new drivers:
- Architectures/processors/systems: SuperH SH7724 processors,
Hitachi SH7724 (SolutionEngine7724) boards,
memory error detection and correction (EDAC) support for AMD K8,
F10h, and F11h processors, ARM PCM043 boards, ARM Atmark
Armadillo 500 boards, OMAP3 Zoom2 boards, OMAP4430 SDP boards
(including SMP support), ARM
MX35PDK boards, Marvell 88F6281 GTW GE boards,Samsung S3C6400 SoCs,
ARMv6/v7 big-endian, Stargate 2 boards, Freescale STMP platforms,
ST-Ericsson U300 series platforms, PowerPC MPC8569MDS boards, PowerPC
P2020DS boards.
- Miscellaneous: Timberdale FPGA UARTs, 64-bit VIA
hardware random number generators, Mediama RMTx add-on board for
ATNGW100, Wacom Graphire Bluetooth tablet, CB710/720 memory card
readers, Maxim 1586 voltage regulators, TI TMP401 temperature sensor,
Intel Langwell USB device controllers, Intel Langwell USB On-the-go
controllers, TI VLYNQ bus, ST Microelectronics DDC I2C interface.
- Networking: Broadcom NetXtremeII gigabit Ethernet cards
(offload features in particular), Intellon PLC (Powerline
communications) USB adapters, Marvell SD8688 wireless chips, Ralink
rt2800 wireless USB chipsets, TI wl1251/wl1271 wireless chipsets, TI
DaVinci Ethernet MACs, Phillips SJA1000 CAN
controllers, Kvaser PCIcanx and Kvaser PCIcan PCI CAN Cards, Intel
wireless Multicomm 3200 devices, Micrel KS8842 ethernet switches.
- Sound: Wolfson Micro WM8988, WM8940, WM9081,
and WM8960 codecs,
Digigram LX6464ES boards,
ESI Maya44 boards,
several Creative Sound Blaster X-FI devices based on the 20K1 and
20K2 chipsets, a USB Audio gadget driver.
- Graphics: AMD r600 chipset, Intel IGDNG chipset.
- Video: DVB-S/S2/DSS Multistandard Professional/Broadcast demodulators, STV6110/(A) based tuners, ISL6423 SEC controllers, TI THS7303 video amplifiers, Analog Devices I2C bus based ADV7343 encoders.
- Architectures/processors/systems: SuperH SH7724 processors,
Hitachi SH7724 (SolutionEngine7724) boards,
memory error detection and correction (EDAC) support for AMD K8,
F10h, and F11h processors, ARM PCM043 boards, ARM Atmark
Armadillo 500 boards, OMAP3 Zoom2 boards, OMAP4430 SDP boards
(including SMP support), ARM
MX35PDK boards, Marvell 88F6281 GTW GE boards,Samsung S3C6400 SoCs,
ARMv6/v7 big-endian, Stargate 2 boards, Freescale STMP platforms,
ST-Ericsson U300 series platforms, PowerPC MPC8569MDS boards, PowerPC
P2020DS boards.
Changes visible to kernel developers include:
- There is a new atomic function:
int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock);This function will decrement cnt and, if cnt reaches zero, it will acquire the given lock.
- A number of block layer request
queue API changes have been merged; all drivers must now dequeue
requests before executing them. Beyond that, the merging of the
storage topology patches (in preparation for 4K-sector disks) mean
that block drivers must now distinguish between the physical block
size on the disk and the logical block size used by the kernel.
- The 32-bit x86 architecture now supports the atomic64_t
type.
- The kernel memory leak
detector has been merged at last.
- The fsnotify backend has been merged. This code provides a new,
common implementation for dnotify and inotify; it also will serve as
the base for the "fanotify" code (formerly TALPA), which has not been
merged as of this writing.
- Btrfs has seen a number of enhancements, including one which involves
an on-disk format change. Existing btrfs filesystems will work with
the new code, but, once mounted on a 2.6.31 system, those filesystems
will no longer work with older kernels.
- Tree read-copy update (RCU) is now the
default, though Classic RCU is still available.
- Changes to the include/asm-generic header files were
merged. These changes are meant to serve as a model for or be used
directly by new architectures
rather than copying from an existing architecture. The S+core (score)
architecture
depends on these changes and the MicroBlaze architecture will be using them
to clean up its ABI.
-
Some rather large cleanups for XFS were merged, including switching to the
generic POSIX ACL code and removal of the xfs_qmops quota ops
function vector.
-
All network drivers have converted to the new net_device_ops API and the
old API available with COMPAT_NET_DEV_OPS has been removed.
-
The rfkill core has been rewritten for devices that implement a way to stop
all radio transmission from the device (in response to a laptop key for
turning off wireless, for example). Various drivers have also been updated
to use the new rfkill API.
-
Debugfs has had all of its references throughout the tree turned into
/sys/kernel/debug/ in both documentation and code. In addition,
LWN's updated guide to debugfs was added to
the Documentation directory.
-
Unicode handling in the kernel has been updated, with functions like
utf_mbstowcs() being renamed to utf8s_to_utf16s() for
better readability.
-
The kmemcheck kernel memory checker to detect the use of uninitialized
memory has been merged.
- The TTM GPU memory manager (covered a bit over a year ago) has been merged.
Linus started merging patches on June 10, suggesting that the merge window will remain open until sometime around June 24. That puts us roughly half way through the merge window, in terms of time. The merge rate will likely slow down some for the remainder of the merge window, but there are, undoubtedly, more interesting changes to come. Stay tuned.
What ever happened to chunkfs?
"What ever happened to chunkfs?" This is a question I hear every few months, and I always answer the same way, "Chunkfs works, the overhead is reasonable, and it is only practical if it is part of the file system design from the beginning, not tacked on after the fact. I just need to write up the paper summarizing all the data." Thanks to your benevolent LWN editor, you are now reading that paper.Background
Before we describe chunkfs, we should first talk about the problems that led to its creation. Chunkfs was motivated by the growing difficulty of reading and processing all the data necessary to check an entire file system. As the capacity of storage grows, the capacity of the rest of the system to read and check that data is not keeping pace with that growth. As a result, the time to check a "normal" file system grows, rather than staying constant as it would if the memory, bandwidth, seek time, etc. of the system grew in proportion to its storage capacity. These differential rates of growth in hardware - like the differences between RAM capacity, bandwidth, and latency - are part of what keeps systems research interesting.
To understand the change in time to check and repair a file system, a useful analogy is to compare the growth of my library (yes, some people still own books) with the growth of my ability to read, organize, and find the books in the library. As a child, my books fit all on one shelf. I could find the book I wanted in a matter of seconds and read every single one of my books in a week. As an adult, my books take up several bookshelves (in addition to collecting on any flat surface). I can read many times faster than I could when I was a kid, and organize my books better, but finding a book can take several minutes, and reading them all would take me a year or more. If I was trying to find a twenty dollar bill I left in a book, it would take me several hours to riffle through all the books to find it. Even though I am a better reader now, my library grew faster than my ability to read or search my library. Similarly, computers are faster than ever, but storage capacity grew even faster.
The consequence of this phenomenon that first caught my attention was the enormous disparity between seek time and capacity of disks. I calculated a projected change in fsck time given these predictions [PDF] from a storage industry giant in 2006:
| Improvement in disk performance, 2006 - 2013 | |
|---|---|
| Capacity: | 16.0x |
| Bandwidth: | 5.0x |
| Seek time: | 1.2x |
| => at least 10x increase in fsck time! | |
Since then, commodity flash-based storage finally became a practical reality. Solid-state disk (SSDs) have much faster "seeks", somewhat improved bandwidth, and drastically reduced capacity compared to disks, so the performance of fsck ought to be extremely good. (I am unable to find any measurements of fsck time on an SSD, but I would estimate it to be on the order of seconds. Readers?) However, SSDs don't solve all our problems. First, SSDs are an element in the cache hierarchy of a system, layered between system RAM and disk, not a complete replacement for disks. The sweet spot for SSDs will continue to expand, but it will be many years or decades before disks are completely eliminated - look at the history of tape. It's easy for a laptop user to wave their hands and say, "Disks are obsolete," but try telling that to Google, your mail server sysadmin, or even your MythTV box.
Second, remember that the whole reason we care about fsck in the first place is because file systems get corrupted. One important source of file system corruption is failures in the storage hardware itself. Ask any SSD manufacturer about the frequency of corruption on its hardware and they will inundate you with equations, lab measurements, and simulations showing that the flash in their SSDs will never wear out in "normal" use - but they also won't reveal the details of their wear-leveling algorithms or the workloads they used to make their predictions. As a result, we get surprises in real-world use, both in performance and in reliability. When it comes to failure rates, we don't have good statistics yet, so I have to fall back on my experience and that of my kernel hacker friends. What we see in practice is that SSDs corrupt more often and more quickly than disks, and prefer to devour the tastiest, most crucial parts of the file system. You can trust the hardware manufacturers when they wave their hands and tell you not to worry your pretty little head, but I put more weight on the money I've made consulting for people with corrupted SSDs.
So, if corruption is a concern, an important benefit of the chunkfs approach is that file system corruption is usually limited to a small part of the file system, regardless of the source of the corruption. Several other repair-driven file system principles - such as duplicating and checksumming important data - make recovery of data after corruption much more likely. So if you don't care about fsck time because you'll be using an SSD for the rest of your life, you might still read on because you care about getting your data back from your SSD.
The conclusion I came to is that file systems should be designed with fast, reliable check and repair as a goal, from the ground up. I outline some useful methods for achieving this goal in a short paper: Repair-driven File System Design [PDF].
Chunkfs design
Chunkfs is the unmusical name for a file system architecture designed under the assumption that the file system will be corrupted at some point. Therefore the on-disk format is optimized not only for run-time performance but also for fast, reliable file system check and repair. The basic concept behind chunkfs is that a single logical file system is constructed out of multiple individual file systems (chunks), each of which can be checked and repaired (fscked) individually.
This is great and all, but now we have a hundred little file systems, with the namespace and disk space fragmented amongst them - hardly an improvement. For example, if we have a 100GB file system divided into 100 1GB chunks, and we want to create a 2GB file, it won't fit in a single chunk - it has to somehow be shared across multiple chunks. So how do we glue all the chunks back together again so we can share the namespace and disk space while preserving the ability to check and repair each chunk individually? What we want is a way to connect a file or directory in one chunk and another file or directory in another chunk in such a way that the connection can be quickly and easily checked and repaired without a full fsck of the rest of the chunk. The solution is something we named a "continuation inode". A continuation inode "continues" a file or directory into another chunk.
You'll notice that there are two arrows in the picture to the left, one pointing from the original inode to the continuation inode, and another pointing back from the continuation inode to the original inode. When checking the second chunk, you can check the validity of the continuation inode quickly, by using the back pointer to look up the original inode in its chunk. This is a check of a "cross-chunk reference" - any file system metadata that makes a connection between data in two different chunks. Cross-chunk references must always have forward and back pointers so that they can be verified starting from either chunk.
Cross-chunk references must satisfy one other requirement: You must be able to quickly find all cross-references from chunk A to arbitrary chunk B, without searching or checking the entire chunk A. To see why this must be, we'll look at an example. Chunk A has an inode X which is continued into chunk B. What if chunk A was corrupted in such a manner that inode X was lost entirely? We would finish checking and repairing chunk A, and it would be internally self-consistent, but chunk B would still have a continuation inode for X that is now impossible to look up or delete - an orphan continuation inode. So as the second step in checking chunk A, we must then find all references to chunk A from chunk B (and all the other chunks) and check that they are in agreement. If we couldn't, we would have to search every other chunk in the file system to check a single chunk - and that's not quick.
A chunkfs-style file system can be checked and repaired incrementally and online, whereas most file systems must be checked all at once while the file system is offline. Some exceptions to this general rule do exist, such as the BSD FFS snapshot-based online check, and an online self-healing mechanism for NTFS, but in general these facilities are hacked on after the fact and are severely limited in scope. For example, if the BSD online check actually finds a problem, the file system must be unmounted and repaired offline, all at once, in the usual way, including a rerun of the checking stage of fsck.
The chunkfs design requires solutions to several knotty problems we don't have room to cover in this article, but you can read about in our 2006 Hot Topics in Dependability paper: Chunkfs: Using divide-and-conquer to improve file system reliability and repair [PDF].
Measurements
Repair-driven file system design, chunkfs in particular, sounds like a neat idea, but as a file system developer, I've had a lot of neat ideas that turned out to be impossible to implement, or had too much overhead to be practical. The questions I needed to answer about chunkfs were: Can it be done? If it can be done, will the overhead of continuation inodes outweigh the benefit? In particular, we need to balance the time it takes to check an individual chunk with the time it takes to check its connections to other chunks (making sure that the forward and back pointers of the continuation inodes agree with their partners in the other chunk). First, let's take a closer look at file system check time on chunkfs.
The time to check and repair a file system with one damaged chunk is the sum of two different components, the time to check one chunk internally and the time to check the cross references to and from this chunk.
Tfs = Tchunk + Tcross
The time to check one chunk is a function of the size of the chunk, and the size of the chunk is the total size of the file system divided by the number of chunks.
Tchunk = f(sizechunk)
sizechunk = sizefs / nchunks
The time to check the cross-chunk references to and from this chunk depends on the number of those cross references. The exact number of cross-chunk references will vary, but in general larger chunks will have fewer cross references and smaller chunks will have more - that is, cross chunk check time will grow as the number of chunks grow.
Tcross = f(nchunks)
So, per-chunk check time gets smaller as we divide the file system into more chunks, and at the same time the cross-chunk check time grows larger. Additionally, the extra disk space taken up by continuation inodes grows as the number of chunks grows, as does the overhead of looking up and following continuation inodes during normal operation. We want to find a sweet spot where the sum of the time to check a chunk and its cross-references is minimized, while keeping the runtime overhead of the continuation inodes small.
Does such a sweet spot exist? The answer depends on the layout of files and directories on disk in real life. If cross-chunk references are extremely common, then the overhead of continuation inodes will outweigh the benefits. We came up with an easy way to estimate the number of cross-chunk references in a chunkfs file system: Take a "real" in-use ext3 file system and for each file, measure the number of block groups containing data from that file. Then, for each directory, measure the number of block groups containing files from that directory. If we add up the number of block groups less one for all the files and directories, we'll get the number of cross-chunk references in a similar chunkfs file system with chunks the size of the ext3 file system's block groups (details here).
Karuna Sagar wrote a tool to measure these cross-references, called cref, and I added some code to do a worst-case estimate of the time to check the cross-references (assuming one seek for every cross-reference). The results were encouraging; assuming disk hardware progresses as predicted, the average cross-chunk cross-reference checking time would be about 5 seconds in 2013, and the worst case would be about 160 seconds (about 2.5 minutes). This is with a 1GB chunk size, so the time to check the chunk itself would be a few seconds. This estimate is worst-case in another way: the ext3 allocator is in no way optimized to reduce cross-chunk references. A chunkfs-style file system would have a more suitable allocator.
Implementations
Chunkfs was prototyped three times. The first prototype, written by Amit Gud for his master's thesis [PDF] in 2006-2007, implemented chunkfs as modifications to the ext2 driver in FUSE. Note that the design of chunkfs described in the thesis is out of date in some respects, see the chunkfs paper [PDF] for the most recent version. He also ported this implementation to the kernel in mid-2007, just for kicks. The results were encouraging. Our main concern was that continuation inodes would proliferate wildly, overwhelming any benefits. Instead, files with continuation inodes were uncommon - 2.5% of all files in the file systems - and no individual file had more than 10 continuation inodes. The test workload included some simulation of aging by periodically deleting files while filling the test file system. (It would be interesting to see the results from using Impressions [PDF], a very sophisticated tool for generating realistic file system images.)
These implementations were good first steps, but they were based on an earlier version of the chunkfs design, before we had solved some important problems. In these implementations, any chunk that had been written to since the file system was mounted had to be fscked after a crash. Given that most of the file system is idle in common usage, this reduced check time by about 2/3 in the test cases, but we are looking for a factor of 100, not a factor of 3. It also lacked a quick way to locate all of the references into a particular damaged chunk, so it only checked the references leading out of the chunk. It used an old version of the solution for hard links which would allow hard links to fail if the target's chunk ran out of space, instead of growing a continuation for the target into a chunk that did have space. It was a good first step, but lacked a key feature: drastically reduced file system repair time.
In mid-2007, I decided to write a prototype of chunkfs as a layered file system, similar to unionfs or ecryptfs, that was completely independent of the client file system used in each chunk. Continuation inodes are implemented as a regular files, using extended attributes to store the continuation-related metadata (the forward and back pointers and the offset and length of the file stored in this continuation inode). When the file data exceeds an arbitrary size (40K in my prototype), a continuation inode is allocated in another chunk and the data after that point in the file is stored in that inode's data. All of the continuation inodes emanating from a particular chunk are kept in one directory so that they can be quickly scanned during the cross-chunk pass of fsck.
To test that the prototype could recover from file system corruption in one chunk without checking the entire file system, I implemented fsck as a simple shell script. First, it fscks the damaged chunk by running the ext2 fsck on it. This may end up deleting or moving arbitrary files in the chunk, which could make it out of sync with the other chunks. Second, it mounts the now-repaired file systems and reads their /chunks directories to find all connections to the damaged chunk and consistency check them. If it finds an orphaned continuation - a continuation whose origin inode in the damaged chunk was destroyed or lost - then it deletes that continuation inode.
The fsck script is deficient in one particular aspect: it checks all the chunks because I didn't write the code to mark a chunk as damaged. This is a difficult problem in general; sometimes we know that a chunk has been damaged because the disk gave us an IO error, or we found an inconsistency while using the file system, and then we mark the file system as corrupted and needing fsck. But plenty of corruption is silent - how can we figure out which chunk was silently corrupted? We can't, but we can quietly "scrub" chunks by fscking them in the background. Currently, ext2/3/4 triggers a paranoia fsck every N mounts or M days since the last check. Unfortunately, this introduces an unexpected delay of minutes or hours at boot time; if you're lucky, you can go have a cup of coffee while it finishes, if you're unlucky, you'll be figuring out how to disable it while everyone who has come to your talk about improvements in Linux usability watches you. (N.B.: Reboot and add "fastboot" to the kernel command line.) With chunkfs, we could run fsck on a few chunks at every boot, adding a few seconds to every boot but avoiding occasional long delays. We could also fsck inactive chunks in the background while the file system in use.
Results
I gave a talk at LCA in January 2008 about chunkfs which included a live demo of the final chunkfs prototype in action: creating files, continuing them into the next chunk, deliberately damaging a chunk, and repairing the file system. Because I carefully prepared a typescript-based fallback in case things went sideways, the demo went perfectly. A video of the talk [Ogg Theora] is available.
Note that I have never been able to bring myself to watch this talk, so I don't know if it contains any embarrassing mistakes. If you want, you can follow along during the demo section with the annotated output from the demo.
The collection-of-file-systems approach ended up being more complex than I had hoped. The prototype used ext2 as the per-chunk file system - a reasonable choice for some throw-away code, but not workable in production. However, once you switch to any reliable file system, you end up with one journal per chunk, which is quite a lot of overhead. In addition, it seemed likely we'd need another journal to recover from crashes during cross-chunk operations. Another source of complexity was the use of a layered file system approach - basically, the chunkfs layer pretends to be a client file system when it's talking to the VFS, and pretends to be the VFS when it's talking to the per-chunk file system. Since none of this is officially supported, we end up with a long list of hacks and workarounds, and it's easy to run into surprise reference counting or locking bugs. Overall, the approach worked for a prototype, but it didn't seem worth the investment it would to make it production quality, especially with the advent of btrfs.
Future work
When I began working on chunkfs, the future of Linux file systems development looked bleak. The chunkfs in-kernel prototype looked like it might be the only practical way to get repair-driven design into a Linux file system. All that has changed; file system development is a popular, well-funded activity and Linux has a promising next-generation file system, btrfs, which implements many principles of repair-driven file system design, including checksums, magic numbers, and redundant metadata. Chunkfs has served its purpose as a demonstration of the power of the repair-driven design approach and I have no further development plans for it.
Conclusions
The three chunkfs prototypes and our estimates of cross-chunk references using real-world file systems showed that the chunkfs architecture works as advertised. The prototypes also convinced us that it would be difficult to retrofit existing journaling file systems to the chunkfs architecture. Features that make file system check and repair are best when designed into the architecture of the file system from the beginning. Btrfs is an example of a file system designed from the ground up with the goal of fast, reliable check and repair.
Credits
Many people and organizations generously contributed to the design and prototyping of chunkfs. Arjan van de Ven was the co-inventor of the original chunkfs architecture. Theodore Ts'o and Zach Brown gave invaluable advice and criticism while we were thrashing out the details. The participants of the Hot Topics in Dependability Workshop gave us valuable feedback and encouragement. Karuna Sagar wrote the cross-reference measuring tool that gave us the confidence to go forward with an implementation. Amit Gud wrote two prototypes while a graduate student at Kansas State University. The development of the third prototype was funded at various points by Intel, EMC, and VAH Consulting. Finally, too many people to list made valuable contributions during discussions about chunkfs on mailing lists, and I greatly appreciate their help.
Avoiding a read-only filesystem on errors
Errors in the storage subsystem can happen due to a variety of known or unknown reasons. A filesystem turns read-only when it encounters errors in the storage subsystem, or a code path which the filesystem code base should not have taken (i.e. a BUG() path). Making the filesystem read-only is a safeguard feature that filesystems implement to avoid further damage because of the errors encountered. Filesystems that change to read-only end up stalling services relying on data writes and, in some cases, may lead to an unresponsive system. In embedded devices, dying applications due to a filesystem turning read-only may render the device useless, leaving the user confused.
Filesystem errors can either be recoverable or non-recoverable. Errors from bad block links in the inode data structures or block pointers can be recovered by filesystem checks. On the other hand, errors due to memory corruption, or a programming error, might not be recoverable because one cannot be sure which data is correct.
Denis Karpov proposed a patchset that would notify user space, so that a user-space policy can be defined to take the appropriate action to avoid turning the filesystem read-only. The patchset is currently limited to FAT filesystems. User-space notifications would allow a filesystem to continue to be used after encountering "correctable errors" if proactive measures are taken to correct them. Such corrective actions can obviate the need for lengthy filesystem checks when the device is mounted next.
The patchset adds a /sys/fs/fat/<volume>/fs_fault file which is initialized to 0 when the filesystem is mounted. The value of fs_fault changes to 1 on error. A user-space program can poll() on this file to check if the value of the file changed which is an indication that an error has occurred. Besides the file polling, a KOBJ_CHANGE uevent is generated, with the uevents environment variable FS_UNCLEAN set to 0 or 1. A udev rule can then trigger the correct program to take care of the damage done by the error.
Kay Sievers is not convinced with the idea of using uevents for user-space notifications, as uevents are designed for device notifications, so they do not fit the design goals of reporting filesystem errors. Filesystem errors are quite often a repeated series of events. For example, a read failure may result in printing multiple read errors in dmesg for each block it is not able to read. An event generated for each block may be too much for udev to handle. Some of the events may get lost, or worse still, may cause udev to ignore uevents from other devices which occurred during the burst of errors.
They could be used to get attention when a superblock does a one-time transition from "clean" to "error", everything else would just get us into serious trouble later.
Keeping <volume>/fs_fault in sysfs is also not the best solution, because sysfs is unaware of filesystem namespaces. The primary responsibility of sysfs is to expose core kernel objects. Filesystem namespaces are a set of filesystem mounts that are only visible to a particular process and may be invisible to the rest of the processes.
A process with a private namespace contains a copy of the namespace instead of a shared namespace. When the system starts, it contains one global namespace which is shared among all processes. Mounts and unmounts in a shared namespace are seen by all the processes in the namespace. A process creates a private namespace if it was created by the clone() system call using the CLONE_NEWNS flag (clone New NameSpace). A process sharing a public namespace can also create a private namespace by calling unshare() with CLONE_NEWNS flag. Mounts and unmounts within a private namespace are only seen by the process that created the namespace. A child process created by fork() shares its parent's namespace.
Because of this, processes might receive errors on a filesystem in a different namespace, so they would not know which device to act on. The problem is also noticeable with processes accessing bind mounts created in a different namespace (bind mounts are a feature in which a sub-tree of a filesystem can be mounted on another directory). Moreover, filesystems spanning multiple devices, such as btrfs, would not be able to report the device name under the current naming structure.
Kay recommends /proc/self/mountinfo as a better alternative, because it contains the list of mount points in the namespace of the process with the specified PID (self). Currently, /proc/self/mountinfo changes when the mount tree changes. This can be extended to propagate errors to user space in the correct namespace using poll(), regardless of the device name. /proc/self/mountinfo can accommodate optional fields which hold values in the form of "tag[:value]" that can be used to communicate the nature of the problem. Instead of using the existing udev infrastructure, this would require a dedicated application to monitor /proc/self/mountinfo, identify the error by parsing the argument, and act accordingly.
Jan Kara further suggests using /proc/self/mountinfo as a link to identify the filesystem device generating the errors:
Despite these objections, everyone agrees that error reporting to user space must not be limited to dmesg messages. User space should be notified of the errors reported by the filesystem, so that it can proactively handle errors and try to correct them. The namespace-unaware /sys filesystem or notifications through uevent may not be the best solution, but, for a lack of a better alternative interface, the developers used sysfs and uevents. The design is still under discussion, and will take some time to evolve, though it is likely that some kind of solution to this problem will make its way into the kernel.
Linux kernel design patterns - part 2
Last week we discussed the value of enunciating kernel design patterns and looked at the design patterns surrounding reference counts. This week we will look at a very different aspect of coding and see why the kernel has special needs, and how those needs have been addressed by successful approaches. The topic under the microscope today is complex data structures.By "complex data structures" we mean objects that are composed of a variable number of simpler objects, possibly a homogeneous collection, possibly a heterogeneous collection. In general it is a structure to which objects can be added, from which objects can be removed, and in which objects can be found. The preferred way to work with such data structures when we study them in an introductory programming course is to use Abstract Data Types.
Abstract Data Types
The idea behind Abstract Data Types is to encapsulate the entire implementation of a data structure, and provide just a well defined interface for manipulating it. The benefit of this approach is that it provides a clean separation. The data type can be implemented with no knowledge of the application which might end up using it, and the application can be implemented with no knowledge of the implementation details of the data type. Both sides simply write code based on the interface which works like a contract to explicitly state what is required and what can be expected.
On the other hand, one of the costs of this approach is tightly connected with the abstract nature of the interface. The point of abstracting an interface is to remove unnecessary details. This is good for introductory computing students, but is bad for kernel programmers. In kernel programming. performance is very important, coming as a close third after correctness and maintainability, and sometimes taking precedence over maintainability. Not all code paths in the kernel are performance-critical, but many are, and the development process benefits from the same data structures being using in both performance critical and less critical paths. So it is essential that data types are not overly abstracted, but that all details of the implementation are visible so that the programmer can make optimal choices when using them.
So the first principle of data structures in the kernel is not to hide detail. To see how this applies, and to discover further principles from which to extract patterns, we will explore a few of the more successful data types used in the Linux kernel.
Linked Lists
Starting simply, the first data type we will explore are doubly linked lists. These are implemented by a single include file, <linux/list.h>. There is no separate ".c" file with any library of support routines. All of the code for handling linked lists is simple enough to be implemented using inline functions. Thus it is very easy to use this implementation in any other (GPLv2-licensed) project.
There are two aspects of the "list.h" lists which are worth noting as they point to possible patterns. The first is struct list_head, which serves not only as the head of a list, but also as the anchor in items that are on a list. Your author has seen other linked list implementations which required that the first and second element in any data structures meant to be stored in lists be the "next" and "previous" pointers, so that common list-walking code could be used on a variety of different data structures. Linux kernel lists do not suffer from this restriction. The list_head structure can be embedded anywhere in a data structure, and the list_heads from a number of instances of that structure can be linked together. The containing structure can be found from a ->next or ->prev pointer using the list_entry() macro.
There are at least two benefits of this approach. One is that the programmer still has full control of placement of fields in the structure in case they need to put important fields close together to improve cache utilization. The other is that a structure can easily be on two or more lists quite independently, simply by having multiple struct list_head fields.
This practice of embedding one structure inside another and using container_of() (which is the general form of list_entry()) to get the parent from the child structure is quite common in the Linux kernel and is somewhat reminiscent of object oriented programming. The container is like a subtype of the embedded structure.
The other noteworthy aspect of list.h is the proliferation of "for_each" macros - the macros that make it easy to walk along a list looking at each item in turn. There are 20 of them (and that isn't counting the four more in rculist.h which I'll choose to ignore in the hope of brevity).
There are a few different reasons for this. The simplest are that
- We sometimes want to walk the list in the "reverse"
direction (following the "prev" link). There are five macros that go
backward, and 15 that go forward.
- We sometimes want to start in the middle of a list and
"continue" on from there, so we have four "continue" macros and
three "from" macros which interpret that starting point slightly
differently.
- We sometimes want to work with the struct list_head embedded in the target structure, but often we really want to use the list_entry() macro to get the enclosing structure; we find it easiest if the "for_each" macro does that for us. This provides the "entry" versions of the "for_each" macro, of which there are 13 (more than half).
Getting to the more subtle reasons, we sometimes want to be able to delete the "current" item without upsetting the walk through the list. This requires that a copy of the "next" pointer be taken before providing "this" entry to be acted upon, thus yielding the eight "safe" macros. An "ADT" style implementation of linked lists would quite likely only provide "safe" versions so as to hide these details. However kernel programmers don't want to waste the storage or effort for that extra step in the common case were it isn't needed.
Then there is the fact that we actually have two subtly different types of linked lists. Regular linked lists use struct list_head as the head of the list. This structure contains a pointer to the start and to the end. In some use cases, finding the end of the list is not needed, and being able to halve the size of the head of the list is very valuable. One typical use case of that kind is a hash table where all these heads need to go in an array. To meet this need, we have the hlist, which is very similar to the regular list, except that only one pointer is needed in struct hlist_head. This accounts for six of the different "for_each" macros.
If we had every possibly combination of forward or reverse, continue or not, entry or not, safe or not, and regular or hlist, we would have 32 different macros. In fact, only 19 of these appear to have been needed and, thus, coded. We certainly could code the remaining eleven, but as having code that is never used tends to be frowned upon, it hasn't happened.
The observant reader will have noticed a small discrepancy in some of the above numbers. Of the 20 macros, there is one that doesn't fit the above patterns, and it drives home the point that was made earlier about kernel programmers valuing performance. This final "for_each" macro is __list_for_each(). All of the other macros use the "prefetch" function to suggest that the CPU starts fetching the ->next pointer at the start of each iteration so that it will already be available in cache when the next iteration starts (though the "safe" macros actually fetch it rather than prefetch it). While this will normally improve performance, there are cases when it will slow things down. When the walk of the list will almost always abort very early - usually only considering the first item - the prefetch will often be wasted effort. In these cases (currently all in the networking code) the __list_for_each() macro is available. It does not prefetch anything. Thus people having very strict performance goals can have a better chance of getting the performance they want.
So from this simple data structure we can see two valuable patterns that are worth following.
- Embedded Anchor: A good way to include generic objects in a data
structure is to embed an anchor in them and build the data structure
around the anchors. The object can be found from the anchor using
container_of().
- Broad Interfaces: Don't fall for the trap of thinking that "one size fits all". While having 20 or more macros that all do much the same thing is uncommon, it can be a very appropriate way of dealing with the complexity of finding the optimal solution. Trying to squeeze all possibilities into one narrow interface can be inefficient and choosing not to provide for all possibilities is counter-productive. Having all the permutations available encourages developers to use the right tool for the job and not to compromise. In 2.6.30-rc4, there are nearly 3000 uses of list_for_each_entry(), about 1000 of list_for_each_entry_safe(), nearly 500 of list_for_each(), and less than 1000 of all the rest put together. The fact that some are used rarely in no way reduces their importance.
RB-trees
Our next data structure is the RB-Tree or red-black tree. This is a semi-balanced, binary search tree that generally provides order "log(n)" search, insert, and delete operations. It is implemented in <linux/rbtree.h> and lib/rbtree.c. It has strong similarities to the list.h lists in that it embeds an anchor (struct rb_node) in each data structure and builds the tree from those.
The interesting thing to note about rbtree is that there is no search function. Searching an rbtree is really a very simple operation and can be implemented in just a few lines as shown by the examples at the top of rbtree.h. A search function certainly could be written, but the developer chose not to. The main reason, which should come as no surprise, is performance. To write a search function, you need to pass the "compare" function into that search function. To do that in C, you would need to pass a function pointer. As compare functions are often very simple, the cost of following the function pointer and making the function call would often swamp the cost of doing the comparison itself. It turns out that having the whole search operation compiled as one function makes for more efficient code. The same performance could possibly be achieved using inline functions or macros, but given that the search function itself is so short, it hardly seems worthwhile.
Note also that rbtree doesn't exactly provide an insert function either. Rather, the developer needs to code a search; if the search fails, the new node must be inserted at the point where it was found not to exist and the tree must be rebalanced. There are functions for this final insertion and rebalancing as they are certainly complex enough to deserve separate functions.
By giving the developer the responsibility for search and for some of insert, the rbtree library actually is giving a lot of valuable freedom. The pattern of "search for an entry but if it isn't there, insert one" is fairly common. However the details of what happens between the "search" and "add" phases is not always the same and so not something that can easily be encoded in a library. By providing the basic tools and leaving the details up to the specific situation, users of rbtree find themselves liberated, rather than finding themselves fighting with a library that almost-but-not-quite does what they want.
So this example of rbtrees re-enforces the "embedded anchors" pattern and suggests a pattern that providing tools is sometimes much more useful than providing complete solutions. In this case, the base data structures and the tools required for insert, remove, and rebalance are provided, but the complete solution can still be tailored to suit each case.
This pattern also describes the kernel's approach to hash tables. These are a very common data structure, but there is nothing that looks even vaguely like a definitive implementation. Rather the basic building blocks of the hlist and the array are available along with some generic functions for calculating a hash (<linux/hash.h>). Connecting these to fit a given purpose is up to the developer.
So we have another pattern:
- Tool Box: Sometimes it is best not to provide a complete solution for a generic service, but rather to provide a suite of tools that can be used to build custom solutions.
Radix tree
Our last data structure is the Radix tree. The Linux kernel actually has two radix tree implementations. One is in <linux/idr.h> and lib/idr.c, the other in <linux/radix-tree.h> and lib/radix-tree.c. Both provide a mapping from a number (unsigned long) to an arbitrary pointer (void *), though radix-tree also allows up to two "tags" to be stored with each entry. For the purposes of this article we will only be looking at one of the implementations (the one your author is most familiar with) - the radix-tree implementation.
Radix-tree follows the pattern we saw in list.h of having multiple interfaces rather than trying to pack lots of different needs into the one interface. list.h has 20 "for_each" macros; radix-tree has six "lookup" functions, depending on whether we want just one item or a range (gang lookups), or whether we want to use the tags to restrict the search (tag lookups) and whether we want to find the place where the pointer is stored, rather than the pointer that is stored there (this is needed for the subtle locking strategies of the page cache).
However radix-tree does not follow the embedded anchor pattern of the earlier data structures, and that is why it is interesting. For lists and rbtree, the storage needed for managing the data structure is exactly proportional to the number of items in the data structure on a one-to-one basis, so keeping this storage in those item works perfectly. For a radix-tree, the storage needed is a number of little arrays, each of which refers to a number of items. So embedding these arrays, one each, in the items cannot work. This means that, unlike list.h and rbtree, radix-tree will sometimes need to allocate some memory in order to be able to add items to the data structure. This has some interesting consequences.
In the previous data structures (lists and rbtrees), we made no mention of locking. If locking is needed, then the user of the data structure is likely to know the specific needs so all locking details are left up to the caller (we call that "caller locking" as opposed to "callee locking". Caller locking is more common and generally preferred). This is fine for lists and rbtrees as nothing that they do internally is affected particularly by locking.
This is not the case if memory allocation is needed, though. If a process needs to allocate memory, it is possible that it will need to sleep while the memory management subsystem writes data out to storage to make memory available. There are various locks (such as spinlocks) that may not be held while a process sleeps. So there is the possibility for significant interaction between the need to allocate memory internally to the radix-tree code, and the need to hold locks outside the radix-tree code.
The obvious solution to this problem (once the problem is understood) is to preallocate the maximum amount of memory needed before taking the lock. This is implemented within radix-tree with the radix_tree_preload() function. It manages a per-CPU pool of available radix_tree nodes and makes sure the pool is full and will not be used by any other radix-tree operation. Thus, bracketing radix_tree_insert() with calls to radix_tree_preload() and radix_tree_preload_end() ensures that the radix_tree_insert() call will not fail due to lack of memory (though the radix_tree_preload() might) and that there will be no unpleasant interactions with locking.
Summary
So we can now summarize our list of design patterns that we have found that work well with data structures (and elsewhere) in the kernel. Those that have already been detailed are briefly included in this list too for completeness.
- Embedded Anchor. This is very useful for lists, and can be
generalized as can be seen if you explore kobjects (an exercise
left to the reader).
- Broad Interfaces. This reminds us that trying to squeeze lots of
use-cases in to one function call is not necessary - just
provide lots of function calls (with helpful and (hopefully)
consistent names).
- Tool Box.
Sometimes it is best not to provide a complete solution for a
generic service, but rather to provide a suite of tools that can be
used to build custom solutions.
- Caller Locks. When there is any doubt, choose to have the caller
take locks rather than the callee. This puts more control in
that hands of the client of a function.
- Preallocate Outside Locks. This is in some ways fairly obvious. But it is very widely used within the kernel, so stating it explicitly is a good idea.
Next week we will complete our investigation of kernel design patterns by taking a higher level view and look at some patterns relating to the design of whole subsystems.
Exercises
For those who would like to explore these ideas further, here are some starting points.
-
Make a list of all data structures that are embedded, by exploring
all uses of the "container_of" macro. Of these, make a list of
pairs that are both embedded in the same structure (modeling multiple
inheritance). Comment on how this reflects on the general
usefulness of multiple inheritance.
-
Write a implementation of skiplists
that would be suitable for in-kernel use. Consider the
applicability of each of the patterns discussed in this article.
Extra credit for leveraging list.h lists.
-
Linux contains a mempool library which radix-tree chooses not to
use, preferring it's own simple pool (in radix_tree_preload).
Examine the consequences of changing radix-tree to use mempool, and
of changing mempool to be usable by radix-tree.
Provide patches to revert this design choice in radix-tree, or a
pattern to explain this design choice.
- Compare the radix-tree and idr implementations to see if one could be implemented using the other without sacrificing correctness, maintainability or performance. Provide either an explanation of why they should stay separate, or a patch to replace one with the other.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Virtualization and containers
Miscellaneous
Page editor: Jake Edge
Next page:
Distributions>>
