Back in 2001, I landed my (then) dream job as a full-time Linux kernel
developer and distribution maintainer for a small embedded systems company.
I was thrilled - and horrified. I'd only been working as a programmer
for a couple of years and I was sure it was only a matter of time
before my new employer figured out they'd hired an idiot. The only
solution was to learn more about operating systems, and quickly. So I
pulled out my favorite operating systems textbook and read and re-read
it obsessively over the course of the next year. It worked well
enough that my company tried very hard to convince me not to quit when
I got bored with my "dream job" and left to work at Sun.
That operating systems textbook was
Internals by Uresh Vahalia. UNIX Internals is a careful, detailed
examination of multiple UNIX implementations as they evolved over
time, from the perspective of both the academic theorist and the
practical kernel developer. What makes this book particularly
valuable to the practicing operating systems developer is that the
review of each operating systems concept - say, processes and threads
- is accompanied by descriptions of specific implementations and their
histories - say, threading in Solaris, Mach, and Digital UNIX. Each
implementation is then compared on a number of practical levels,
including performance, effect on programming interfaces, portability,
and long-term maintenance burden - factors that Linux developers care
passionately about, but are seldom considered in the academic
operating systems literature.
UNIX Internals was published in 1996. A valid question is whether a
book on the implementation details of UNIX operating systems published
so long ago is still useful today. For example, Linux is only
mentioned briefly in the introduction, and many of the UNIX variants
described are now defunct. It is true that UNIX Internals holds
relatively little value for the developer actively staying up to date
with the latest research and development in a particular area.
However, my personal experience has been that many of the problems
facing today's Linux developers are described in this book - and so
are many of the proposed solutions, complete with the unsolved
implementation problems. More importantly, the analysis is often
detailed enough that it describes exactly the changes needed to
improve the technique, if only anyone took the time to implement them.
In the rest of this review, we'll cover two chapters of UNIX Internals
in detail, "Kernel Memory Allocation" and "File System
Implementations." The chapter on kernel memory allocation is an
example of the historical, cross-platform review and analysis that
sets this book apart, covering eight popular allocators from several
different flavors of UNIX. The chapter on file system implementations
shows how lessons learned from the oldest and most basic file system
implementations can be useful when solving the latest and hottest file
system design problems.
Kernel Memory Allocation
The kernel memory allocator (KMA) is one of the most
performance-critical kernel subsystems. A poor KMA implementation
will hurt performance in every code path that needs to allocate or free
memory. Worse, it will fragment and waste precious kernel memory -
memory that can't be easily freed or paged out - and pollute hardware
caches with instructions and data used for allocation management.
Historically, a KMA was considered pretty good if it only wasted 50%
of the total memory allocated by the kernel.
Vahalia begins with a short conceptual description of kernel memory
allocation and then immediately dives into practical implementation,
starting with page-level allocation in BSD. Next, he describes memory
allocation in the very earliest UNIX systems: a collection of
fixed-size tables for structures like inodes and process table
entries, occasional "borrowing" of blocks from the buffer cache, and a
few subsystem-specific ad hoc allocators. This primitive approach
required a great deal of tuning, wasted a lot of memory, and made the
What constitutes a good KMA? After a quick review of the functional
requirements, Vahalia lays out the criteria he'll use to judge the
allocators: low waste (fragmentation), good performance, simple
interface appropriate for many different users, good alignment,
efficient under changing workloads, reassignment of memory allocated
for one buffer size to another, and integration with the paging
system. He also takes into consideration more subtle points, such as
the cache and TLB footprint of the KMA's code, along with cache and
lock contention in multi-processor systems.
This is an
example of how even the oldest and clunkiest algorithms can influence the
design of the latest and greatest.
The first KMA reviewed is the resource map allocator, an extremely
simple allocator using a list of <base, size> pairs
describing each free segment of memory, sorted by base address. The
charms of the resource map allocator include simplicity and allocation
of exactly the size requested; the vices include high fragmentation
and poor performance under nearly every workload. Even this
allocation algorithm is useful under the right circumstances; Vahalia
describes several subsystems that still use it (System V semaphore
allocation and management of free space in directory blocks on some
systems) and some minor tweaks that improve the algorithm. One tweak
to the resource map allocator keeps the description of each free
region in the first few bytes of the region, a technique later used in
the state-of-the-art SLUB allocator in the Linux kernel. This is an
example of how even the oldest and clunkiest algorithms can influence the
design of the latest and greatest.
Each following KMA is discussed in terms of the problems it solves
from previous allocators, along with the problems it introduces. The
resource map's sorted list of base/size pairs is followed by
power-of-two free lists with a one-word in-buffer header (better
performance, low external fragmentation, but high internal
fragmentation, esp. for exact power-of-two allocations), the
McKusick-Karels allocator (power-of-two free lists optimized for
power-of-two allocation; extremely fast, but prone to external
fragmentation), the buddy allocator (buffer splitting on power-of-two
boundaries plus coalescing of adjacent free buffers; poor performance
due to unnecessary splitting and coalescing), and the lazy buddy
allocator (buddy plus delayed buffer coalescing; good steady-state
performance but unpredictable under changing workloads). The
accompanying diagrams of the data structures and buffers used to
implement each allocator are particularly helpful in understanding the
structure of the allocators.
After covering the simpler KMAs, we get into more interesting
territory: the zone allocator from Mach, the hierarchical allocator
from Dynix, and the SLAB allocator, originally implemented on Solaris
and later adopted by several UNIXes, including Linux and the BSDs.
Mach's zone allocator is the only fully garbage-collected KMA studied,
with the concomitant unpredictable system-wide performance slowdowns
during garbage collection, which would strike it from most developers'
lists of useful KMAs. But as with the resource map allocator, we
still have lessons to learn from the zone allocator. Many of the
features of the zone allocator also appear in the SLAB allocator,
commonly considered the current best-of-breed KMA.
The zone allocator creates a "zone" of memory reserved for each class
of object allocated (e.g., inodes), similar to kmem caches in the
later SLAB allocator. Pages are allocated to a zone as needed, up to
a limit set at zone allocation time. Objects are packed tightly
within each zone, even across pages, for very low internal
fragmentation. Anonymous power-of-two zones are also available. Each
zone has its own free list and once a zone is set up, allocation and
freeing simply add and remove items from the per-zone free list (free
list structures are also allocated from a zone). Memory is reclaimed
on a per-page basis by the garbage collector, which runs as part of
the swapper task. It uses a two-pass algorithm: the first pass counts
up the number of free objects in each page, and the second pass frees
empty pages. Overall, the zone allocator was a major improvement on
previous KMAs: fast, space efficient, and easy to use, marred only by
the inefficient and unpredictable garbage collection algorithm.
The next KMA on the list is the hierarchical memory allocator for
Dynix, which ran on the highly parallel Sequent S2000. One of the
major designers and implementers is our own Paul McKenney, familiar to
many LWN readers as the progenitor of the read-copy-update (RCU)
system used in many places in the Linux kernel. The goal of the Dynix allocator was
efficient parallel memory allocation, in particular avoiding lock
contention between processors. The solution was to create several
layers in the memory allocation system, with per-cpu caches at the
bottom and collections of large free segments at the top. As memory
is freed or allocated, regions move up and down one level of the
hierarchy in batches. For example, each per-cpu cache has two free
lists, one in active use and the other in reserve. When the active
list runs out of free buffers, the free buffers from the reserve list
are moved onto it, and the reserve list replenishes itself with
buffers from the global list. All the work requiring synchronization
between multiple CPUs happens in one big transaction, rather than
incurring synchronization overhead on each buffer allocation.
The Dynix allocator was a major advance: 3 - 5 times faster than the
BSD allocator even on a single CPU. Its memory reclamation system was
far more efficient than the zone allocator's, performed on an on-going
basis with bounded worst case performance on each operation.
Performance on SMP systems was unparalleled.
The final KMA in this chapter is the SLAB allocator, initially
implemented on Solaris and later re-implemented on Linux and BSD. The
SLAB allocator refined some existing techniques (simple allocation/free
computations for small cache footprint, per-object caches) and
introduced several new ones (cache coloring, efficient object reuse).
The result is an allocator that was both the best performing and the
most efficient by a wide margin - only 14% fragmentation versus 27%
for the SunOS 4.1.3 sequential-fit allocator, 45% for the 4.4BSD
McKusick-Karel allocator, and 46% for the SunOS 5.x buddy allocator.
Like the zone allocator, SLAB allocates per-object caches (along with
anonymous caches in useful sizes) called kmem caches. Each cache has
an associated optional constructor and destructor function run on the
objects in a newly allocated and newly freed page, respectively (though the
destructor has since been removed in the Linux allocator).
Each cache is a doubly-linked list of slabs - large contiguous chunks
of memory. Each slab keeps its slab data structure at the end of the
slab, and divides the rest of the space into objects. Any leftover
free space in the slab is divided between the beginning and end of the
objects in order to vary the offset of objects with respect to the CPU
cache, improving cache utilization (in other words, cache coloring).
Each object has an associated 4-byte free list pointer.
The slabs within each kmem cache are in a doubly linked list, sorted
so that free slabs are located at one end, fully allocated slabs at
the other, and partially allocated slabs in the middle. Allocations
always come from partially allocated slabs before touching free
slabs. Freeing an object is simple: since slabs are always the same
size and alignment, the base address of the slab can be calculated
from the address of the object being freed. This address is used to
find the slab on the doubly linked list. Free counts are maintained
on an on-going basis. When memory pressure occurs, the slab allocator
walks the kmem caches freeing the free slabs at the end of the cache's
slab list. Slabs for larger objects are organized differently, with
the slab management structure allocated separately and additional
buffer management data included.
This section of UNIX Internals has aged particularly well, partly
because the SLAB allocator continues to work well on modern systems.
As Vahalia notes, the SLAB allocator initially lacked optimizations
for multi-processor systems, but these were added shortly afterward,
using many of the same techniques as the Dynix hierarchical allocator.
Since then, most production kernel memory allocators have been
SLAB-based. Recently, Christoph Lameter rewrote SLAB
to get the SLUB allocator for Linux; both are available as kernel
configuration options. (The third option, the SLOB allocator, is not
related to SLAB - it is a simple allocator optimized for small
embedded systems.) When viewed in isolation, the SLAB allocator may
appear arbitrary or over-complex; when viewed in the context of
previous memory allocators and their problems, the motivation behind
each design decision is intuitive and clear.
File Systems Implementations
UNIX Internals includes four chapters on file systems, covering the
user and kernel file system interface (VFS/vnode), implementations of
on-disk and in-memory file systems, distributed/network file systems,
and "advanced" file system topics - journaling, log-structured file
systems, etc. Despite the intervening years, these four chapters are
the most comprehensive and practical description of file systems
design and implementation I have yet seen. I definitely recommend it
File System Design and Implementation - a massive sprawling book
which lacks the focus and advanced implementation details of UNIX
The chapter on file systems implementations is too packed with useful
detail to review fully in this article, so I'll focus on the points
that are relevant to current hot file system design problems. The
chapter describes the System V File System (s5fs) and Berkeley Fast
File System (FFS) implementations in great detail, followed by a
survey of useful in-memory file systems, including tmpfs, procfs
(a.k.a. /proc file system), an early variant of a device file
system called specfs, and a sysfs-style interface for managing
processors. This chapter also covers the implementation of buffer
caches, inode caches, directory entry caches, etc. One of the
features of this chapter (as elsewhere in the book) is the carefully
chosen bibliography. Bibliographies in research papers serve a double
purpose as demonstrations of the authors' breadth of knowledge in the
area and tend to be cluttered with more marginal references; the
per-chapter bibliographies in UNIX Internals list only the most
relevant publications and make excellent supplementary reading guides.
System V File System (s5fs) evolved from the first UNIX file system. The
on-disk layout consisted of a boot block followed by a superblock
followed by a single monolithic inode table. The remainder of the
disk is used for data and indirect blocks. File data blocks are
located via a standard single/double/triple indirect block scheme.
s5fs has no block or inode allocation bitmaps; instead it maintains
on-disk free lists. The inode free list is partial; when no more free
inodes are on the list, it is replenished by scanning the inode table.
Free blocks are tracked in a singly linked list rooted in the
superblock - a truly terrifying design from the point of view of file
system repair, especially given the lack of backup superblocks.
In many respects, s5fs is simultaneously the simplest and the worst
UNIX file system possible: its throughput was commonly as little as 5%
of the raw disk bandwidth, it was easily corrupted, it had a 14
character limit on file names, and so on. On the other hand, elements
of the s5fs design have come back into vogue, often without addressing
the inherent drawbacks still unsolved in the intervening decades.
The most striking example of a new/old design principle illustrated by
s5fs is the placement of most of the metadata in one spot. This
turned out to be a key performance problem for s5fs, as every uncached
file read virtually guaranteed a disk seek of non-trivial magnitude
between the location of the metadata at the beginning of the disk and
the file data, located anywhere except the beginning of the disk. One
of the major advances of FFS was to distribute inodes and bitmaps
evenly across the disk and allocate associated file data and indirect
blocks nearby. Recently, collecting metadata in one place has
returned as a way to optimize file system check and repair time as
well as other metadata-intensive operations. It also appears in
designs that keep metadata on a separate high-performance device
(usually solid state storage).
The problems with these schemes are the same as the first time around.
For the fsck optimization case, most normal workloads will suffer from
the required seek for reads of file data from uncached inodes (in
particular, system boot time would suffer greatly). In the separate
metadata device case, the problem of keeping a single,
easily-corrupted copy of important metadata returns. Currently, most
solid-state storage is less reliable than disk, yet most proposals to
move file system metadata to solid state storage make no provision for
backup copies on disk.
Another cutting edge file system design issue first encountered in
s5fs is backup, restore, and general manipulation of sparse files.
System administrators quickly discovered that it was possible to
create a user-level backup that could not be restored because the
tools would attempt to actually write (and allocate) the zero-filled
unallocated portions of sparse files. Even more intelligent tools
that do not explicitly write zero-filled portions of files still had
to pointlessly copy pages of zeroes out of the kernel when reading
sparse files. In general, the file and socket I/O interface requires
a lot of ultimately unnecessary copying of file data into and out of
the kernel for common operations. It has only been in the last few
years that more sophisticated file system interfaces have been
proposed and implemented, including SEEK_HOLE/SEEK_DATA
and splice() and friends.
The chapters on file systems are definitely frustratingly out of date,
especially with regard to advances in on-disk file system design.
You'll find little or no discussion of copy-on-write file systems,
extents, btrees, or file system repair outside of the context of
non-journaled file systems. Unfortunately, I can't offer much in the
way of a follow-up reading list; most of the papers in my
file systems reading list
are covered in this book (exceptions include the papers on soft
updates, WAFL, and XFS). File systems developers seem to publish less
often than they used to; often the options for learning about the
cutting edge are reading the code, browsing the project wiki, and
attending presentations from the developers. Your next opportunity
for the latter is the Linux
Plumbers Conference, which has a number of file system-related
Another major flaw in the book, and one of the few places where
Vahalia was charmed by an on-going OS design fad, is the near-complete
lack of coverage of TCP/IP and other networking topics (the index
entry for TCP/IP lists only two pages!). Instead, we get an entire
chapter devoted to streams, at the time considered the obvious next
step in UNIX I/O. If you want to learn more about UNIX networking
design and implementation, this is the wrong book; buy some of the
Stevens and Comer networking books instead.
UNIX Internals was the original inspiration for the Kernel Hacker's
Bookshelf series, simply because you could always find it on the
bookshelf of every serious kernel hacker I knew. As the age of the
book is its most serious weakness, I originally intended to wait until
the planned second edition was released before reviewing it. To my
intense regret, the planned release date came and went and the second
edition now appears to have been canceled.
UNIX Internals is not the right operating systems book for everyone;
in particular, it is not a good textbook for an introductory operating
systems course (although I don't think I suffered too much from the
experience). However, UNIX Internals remains a valuable reference book
for the practicing kernel developer and a good starting point for
the aspiring kernel developer.
to post comments)