Brief items
The current 2.6 development kernel is 2.6.29-rc8,
released by Linus on
March 12. It has the usual set of fixes, and the patch rate does seem
to be slowing down. "
In fact, it seems to be stabilizing to the
point where I'm hoping that we're approaching a final 2.6.29, and this
might be the last -rc. We'll have to see." The short-form changelog
is in the announcement, or see
the
full changelog for all the details.
As of this writing, nearly 200 changesets have been merged since
the 2.6.29-rc8 release. They are mostly fixes, there is also a new mascot
logo, a driver for ServerEngines BladeEngine 2 adapters, and a driver
for the "Dave" Ethernet controller found on the Qong Board FPGA.
The current stable 2.6 kernel is 2.6.28.8, released on March 16; the
2.6.27.20 update was released
at the same time. Both contain an even longer than usual list of important
fixes.
Comments (none posted)
Kernel development news
That was added five or six years ago, and I never ever got to eat
my hat.
--
Andrew Morton
Finally, with a lot of delay, I've just released the first full
public version of my nftables code (including userspace), which is
intended to become a successor to iptables. Its written from
scratch and there are numerous differences to iptables in both
features and design...
--
Patrick McHardy
I'm really fed up with these discussions. I have seen almost _zero_
critical thinking at all. Probably because anybody who is in the
least doubtful about it simply has tuned out the discussion. So
here's my input: start small, start over, and start thinking about
other issues than just checkpointing.
--
Linus Torvalds seeks to restart the
checkpoint discussion
Comments (1 posted)
Here's
a posting from Ted Ts'o on ext4, delayed allocation, and robustness. "
Whats the best path forward? For now, what I would recommend to Ubuntu gamers whose systems crash all the time and who want to use ext4, to use the nodelalloc mount option. I havent quantified what the performance penalty will be for this mode of operation, but the performance will be better than ext3, and at least this way they wont have to worry about files getting lost as a result of delayed allocation. Long term, application writers who are worried about files getting lost on an unclean shutdown really should use fsync."
Comments (232 posted)
Here is
Matthew
Garrett's contribution to the ext4 debate. Not for anybody who is
offended by profanity. "
Dear filesystem writers - application
developers like writing lots of tiny files, because it makes a large number
of things significantly easier. This is fine because sheer filesystem
performance is not high on the list of priorities of a typical application
developer. The answer is not 'Oh, you should all use sqlite'. If the only
effective way to use your filesystem is to use a database instead, then
that indicates that you have not written a filesystem that is useful to
typical application developers who enjoy storing things in files rather
than binary blobs that end up with an entirely different set of
pathological behaviours."
Comments (164 posted)
![[Tuz]](/images/ns/kernel/tuz-logo.png)
As everybody knows, only important fixes will be merged into the mainline
kernel at this late stage of the development cycle. One of the fixes
merged by Linus on March 17 was a high-resolution SVG image of "Tuz,"
the mascot of the 2009 linux.conf.au conference. Tuz, in his new home at
Documentation/logo.svg, serves to remind the world of the
difficulties faced by the Tasmanian
devil and how the linux.conf.au attendees
supported the effort
to save this species from extinction.
Comments (15 posted)
March 18, 2009
This article was contributed by Valerie Aurora (formerly Henson)
A unioning file system combines the namespaces of two or more file
systems together to produce a single merged namespace. This is useful
for things like a live CD/DVD: you can union-mount a small, writable
file system on top of a read-only DVD file system and have a
usable system without needing to transfer the system files from the DVD to
the root file system. Another use is to export a single read-only
base file system via NFS to multiple clients, each with their own
small, writable overlay file system union mounted on top. Another
use might be a default set of files for a home directory.
Unioning file systems of various sorts have been around since at least
the Translucent File Service (or File System), written around 1988 for
SunOS. BSD has had union mounts since 4.4 BSD-Lite, around 1994, and
Plan 9 implemented union directories in a similar time frame. The
first prototype of a union-style file system for Linux was the
Inheriting File
System (IFS), written by Werner Almsberger for Linux 0.99 in 1993.
It was later abandoned, as Werner writes, because "I looked back at
what I did and was disgusted by its complexity." Later
implementations for Linux
include unionfs
(2003),
aufs (2006),
and union mounts (2004),
as well as FUSE prototypes and implementations of various forms. So
how is it that in 2009, Linux still does not have an in-tree unioning
file system?
The short answer is that none of the existing implementations meet the
kernel maintainers' standards of simplicity, clarity, and correctness.
What makes it so difficult to write a unioning file system that meets
these standards? In this week's article, we'll review general
unioning file system concepts, and common design problems and
solutions. In a subsequent article, we'll review the top contenders for a
mainline kernel unioning file system for Linux, as well as unioning
file systems for other operating systems.
Unioning file systems implementation guidelines
First, let's define what a useful unioning file system implementation
would look like. The basic desire is to have one read-only file
system, and some sort of writable overlay on top. The overlay must
persistently store changes and allow arbitrary manipulation of the
combined namespace (including persistent effective removal of parts of
the read-only namespace - "whiteouts"). The implementation should be
fast enough for use as a root file system, and should require little
or no modification of underlying file systems. It should be
implemented in the kernel; FUSE-based file systems have many uses but
aren't appropriate for many use cases (root file systems, embedded
systems, high (or even moderate) file system performance, etc.).
A successful unioning file system will be an improvement (in terms of
disk space used, cost of code maintenance, deployment time, etc.) over
the alternatives - copying over all the files up front, clever
division of file systems into multiple mount points, or writing an
entire new on-disk file system. If we gain all the features of a
unioning file system, but complicate the VFS code too much, we'll have
a union file system at the cost of a slow, unmaintainable, buggy VFS
implementation. If a union file system includes as much or more code
than the underlying file system implementations, you start to wonder
if supporting unioning in underlying file systems individually would
be more maintainable.
One alternative to unioning file systems is the copy-on-write block
device, often used for virtualized system images. A COW block device
works for many applications, but for some the overhead of a block
device is much higher than that of a unioning file system. Writes to
a file system on a COW block device will produce many duplicated
blocks as the bitmaps, journals, directory entries, etc. are written.
A change that could be encapsulated in a few bytes of directory entry
in a unioning file system could require several hundred kilobytes of
change at the block level. Worse, changes to a block device tend only
to grow; even with content-based block merging (not a common feature),
two file systems which are logically nearly identical may differ by
many megabytes at the block level. An important case is one in which
you delete many files; with a unioning file system this will decrease
the space needed to store differences. With a COW block device, used
space will increase.
One problem that should NOT be solved by unioning file systems (and
file systems in general) is source control. Modern source control
systems are quite good compared to the crud we had to live with back
in the 1990's. Source control software back then was so buggy and
slow that it actually seemed like a good idea to implement parts of it
in the kernel; indeed, some commercially successful source control
systems still in use today require explicit file system support.
Nowadays, many fast, reliable source control systems are available and
it is generally agreed that complex and policy-heavy software should
be implemented outside of the kernel. (Those who disagree are welcome
to write gitfs.)
General concepts
Most unioning file systems share some general concepts. This section
describes branches, whiteouts, opaque directories, and file copy up.
Branches are the various file systems that are unioned together.
Branch access policies can be read-only, writable, or more complex
variations which depend on the permissions of other branches.
Branches are ordered or stacked; usually the branch "on top" is the
writable branch and the branch "on the bottom" is read-only.
Branches can sometimes be re-ordered, removed, added, or their
permissions changed on the fly.
A commonly required feature is that when a particular directory entry
is deleted from a writable branch, that directory entry should never
appear again, even if it appears in a lower branch. That is, deleting
a file named "x" should result in no file named "x" in that directory
afterward, even if a file named "x" exists in a lower branch.
Usually this is implemented through a combination of whiteouts and
opaque directories. A whiteout is a directory entry that covers up
all entries of a particular name from lower branches. An opaque
directory does not allow any of the namespace from the lower branches
to show through from that point downwards in the namespace.
When a file on a read-only branch is altered, it must be copied up to
some writable branch. Copy up is usually triggered either when the
file is opened with write permissions, or when the first write to the
file's data or metadata occurs.
Common design problems
Unioning file systems often run into the same difficult design
problems. Often, these problems only have a few obvious options with
built-in tradeoffs, and unioning file systems can be characterized to
some degree by which set of tradeoffs they choose. In this section,
we'll review some of the top design problems and their common
solutions.
Namespace pollution:
Often, whiteouts, opaque directories, persistent inode numbers, and
any other persistent file system data are stored in specially-named
files. These files clutter up the available namespace and produce
unexpected behavior. Some minor namespace pollution has been
acceptable in UNIX as long as it is restricted to the top-level
directory (think "/lost+found"), but namespace pollution on a
per-directory or per-file basis is generally frowned on. Solutions to
this problem include various ways of shuffling around the namespace
pollution - moving it to a single subdirectory per directory or file
system or storing it in an external file system - or not creating
namespace pollution in the first place (which generally requires
support from the underlying file system).
Whiteouts and opaque directories:
While the concepts of whiteouts and opaque directories are fairly
general, the implementation varies. The most common option is to
create a directory entry with a special name, such as
".wh.<filename>" which indicates that the corresponding
directory entry should never be returned. This can cause clashes with
user file names and prevent stacking one union over another. It also
makes directory removal time-consuming. An "empty" directory can have
many thousands of whiteout entries which must be deleted before the
rmdir() can complete. Sometimes directory removals are pushed off to
a special work thread to improve rmdir() latency.
Another option is to create a special file or directory entry type to
mark whiteout directory entries, and give whiteout directory entries a
reserved inode number. The name in the directory entry itself is the
same as the name being whited out and does not appear in directory
listings. This form of whiteouts requires support from the underlying
file system to store the necessary flags and from the file system
utilities to accept the special entries. One more option is to make
whiteout entries be hard links to special whiteout files, or symbolic
links to reserved targets. The hard link solution requires handling
the case of exceeding the maximum link count on the target file.
Directories can be marked opaque with either a flag in the directory
inode (again requiring support from the underlying file system) or
with a directory entry with a reserved name, like ".opaque".
Timing of directory copies:
When a file is created on a writable branch in a directory that exists
only on another branch, the entire directory path to that file must be
present on the writable branch. The path can be created on-demand
when the target file is copied, or each directory may be preemptively
copied to the writable branch as it is opened. Creating paths on
demand introduces complexity, locking issues, and additional latency
on file writes. Creating paths on directory open simplifies code and
improves latency on file write, but uses up additional (often
negligible) space on the writable branch.
Kernel memory usage:
Unioning file systems often introduce new data structures, extra
copies of data structures, and a variety of cached metadata. For
example, a union of two file systems may require three VFS inodes to
be allocated for one logical object. The most common implementation
of whiteouts and duplicate entry removal requires reading all
directory entries from all branches into memory and constructing a
coherent view of the directory there. If this cache is maintained
across system calls, we have to worry about regenerating it when
underlying branches change. When it is not cached, we have to
reallocate memory and remerge the entries repeatedly. Either way, a
lot of kernel memory must be allocated.
Code cleanliness:
One of the main points of a unioning file system is to reuse existing
file system code, with the expected benefits in ease of maintenance
and future development in that code base. If the implementation is
excessively complex or intrusive, the net effect will be a reduction
in ease of maintenance and development. The sheer number and variety
of Linux file systems (on-disk, in-memory, and pseudo) demonstrates
the benefit of clean and simple file system interfaces.
Stack depth:
The Linux kernel has a limited available stack depth. Calling
internal file system functions from unexpected code paths or, worse
yet, recursively, can easily result in exceeding the kernel stack
limit. Some unioning file systems are implemented as stacked or
layered file systems, which inherently add to stack depth.
Number of branches:
Memory usage is often proportional to the number of branches in use.
Branches may be limited to a compile-time maximum, but allocating
enough memory for the maximum is prohibitive. Dynamic allocation and
reallocation as branches are added and removed can be complex and
introduce new opportunities for failure.
Coherency of changes:
A common feature request is to allow modification of more than one
branch at once. This requires some method of cache coherence between
the various parts of the file system. Usually this method is absent
or only partially correct. Users often request direct access to the
file systems making up the branches of the union (instead of access
through the unioning file system), a situation particularly difficult
to deal with.
Dynamic branch management:
Users often would like to add, remove, or change the policies of
branches in a union while the union is still mounted. In trivial
cases, this is a simple matter of parsing mount options and
manipulating data structures, but may have major effects on any cache
coherency implementation.
readdir() and friends:
One of the great tragedies of the UNIX file system interface is the
enshrinement of readdir(), telldir(), seekdir(), etc. family in the
POSIX standard. An application may begin reading directory entries
and pause at any time, restarting later from the "same" place in the
directory. The kernel must give out 32-bit magic values which allow
it to restart the readdir() from the point where it last stopped.
Originally, this was implemented the same way as positions in a file:
the directory entries were stored sequentially in a file and the
number returned was the offset of the next directory entry from the
beginning of the directory. Newer file systems use more complex
schemes and the value returned is no longer a simple offset. To
support readdir(), a unioning file system must merge the entries from
lower file systems, remove duplicates and whiteouts, and create some
sort of stable mapping that allows it to resume readdir() correctly.
Support from userspace libraries can make this easier by caching the
results in user memory.
Stable, unique inode numbers:
NFS exports and some applications expect inode numbers of files not to
change between mounts, reboots, etc. Applications also expect that
files can be uniquely identified by a combination of the device id and
the inode number. A unioning file system can't just copy up the inode
number from the underlying file system since the same inode is very
likely to be used on more than one file system. Unique (but not
stable) inode numbers can be implemented fairly trivially, but stable
inode numbers, require some sort of persistent state mapping files from
the underlying file system to assigned inode numbers.
mmap():
mmap() is always the hard part. A good way to sound smart without
knowing anything about file systems is to nod sagely and say, "What
about mmap()?" mmap() on a file in a union file system is hard
because the file may suddenly disappear and be replaced with another
(copy up of a file on write from another process) or can be changed
without the cooperation of the unioning file system (direct writes to
the file systems backing branches). Some unioning file systems permit
forms of mmap() which are not correctly implemented according to the
POSIX standard.
rename() between branches:
rename() is a convenient system call that allows atomic file renames
on the same file system. Some unioning file systems try to emulate
rename() across different branches. Others just return EXDEV, the
error code for an attempted rename() across different file systems.
Most applications can cope with the failure of rename() and fall back
to a normal file copy.
File system implementation differences:
Different file systems have different rules about permitted file
names, allowed characters, name length, case sensitivity, extended
attribute names and sizes, etc. The unioning file system wants to
translate from one file system's rules to another. Some union file
systems just restrict the types of file systems they support rather
than implement complex translation code for uncommon use cases.
Multiple hard links:
A file on a lower branch may have multiple hard links; that is,
multiple paths point to the same inode. When a file with multiple
hard links on a read-only branch is altered, strict UNIX semantics
require finding all the other paths to that file and copying them to
the writable branch as well. Unfortunately, UNIX file systems
generally don't provide an efficient way to find all the paths to an
inode. Some union file systems keep a list of inodes with
undiscovered alternate paths and copy them over when a new path is
accessed, others just ignore the problem. It's an open question as to
how many applications depend on the correct behavior of hard links
when used in this manner.
Permissions, owner, and timestamps:
These attributes are often calculated using using a combination of
underlying file system permissions, options specified on mount, the
user and umask at time of mount, and branch management policies.
Exact policies vary wildly.
Feature creep:
Sometimes, the features provided by unioning file systems go beyond
the actual needs of 99.9% of unioning use cases. This may be the
result of fanciful user requests, or a "just because we can" approach
- users only ask for two or three branches, but it's possible to support
thousands of branches, so let's do it, or the code is structured such
that all combinations of X, Y, and Z features are trivial to
implement, even though users only want X and Y or Y and Z. Each
feature may seem nearly free, but often ends up constraining the
implementation or providing new opportunities for bugs. Focus is
important.
Summary
Union file systems are inherently difficult to implement for many
reasons, but much of the complexity and ugliness come from solving the
following problems: whiteouts,
readdir() support, stable inode
numbers, and concurrent modifications to more than one branch at a
time. Few of these problems have obviously superior solutions, only
solutions with different tradeoffs.
Coming next
In the next article, we will review BSD union mounts, unionfs, aufs,
and Linux union mounts. For each of these unioning file systems,
we'll look at specific implementation details, including kernel data
structures and their solutions to the various union file system
problems described in this article. We'll also review the issues
surrounding merging each solution into the mainline Linux kernel.
We'll wrap up with general analysis of what does and doesn't work for
unioning file systems, from both a purely technical and a software
engineering point of view.
To be continued...
Acknowledgements
Many discussions with developers helped with the background of this
article, but this article in particular benefited from discussions
with Erez Zadok, Christoph Hellwig, and Jan Blunck.
Comments (41 posted)
By Jake Edge
March 13, 2009
There are quite a variety of tracing options
for Linux, with SystemTap being the most prominent, but that particular
solution has yet to become easily usable, at least partly due to its many
dependencies on user-space configuration and tools. Another choice, which
originally came from work on
realtime Linux, is ftrace. Ftrace is a self-contained solution, requiring
no user-space tools or support, that is
useful for tracking down problems—not only in the kernel, but in its
interactions with user space as well.
The name ftrace comes from "function tracer", which was its original
purpose, but it can do more than that. Various additional tracers have
been added to look at things like context switches, how long interrupts are
disabled, how long it takes for high-priority tasks to run after they have
been woken up, and so on. Its genesis in the realtime tree is evident in
the tracers so far available, but ftrace also includes a plugin framework
that allows new tracers to be added easily.
In a suitably configured kernel—one with CONFIG_FUNCTION_TRACER
enabled—accessing ftrace is done through the debug filesystem.
Typically, it is mounted this way:
# mount -t debugfs nodev /debug
That creates a
/debug/tracing subdirectory which is used to
control ftrace and for getting output from the tool.
One chooses a particular tracer to use by writing to
/debug/tracing/current_tracer—possibly after querying the
tracers
available by reading /debug/tracing/available_tracers.
On a recent Fedora rawhide kernel, checking the tracers available results
in:
# cat /debug/tracing/available_tracers
power wakeup irqsoff function sysprof sched_switch initcall nop
Tracing is
then enabled or disabled by writing to
/debug/tracing/tracing_enabled. Using the function tracer would
be done as follows:
# echo function >/debug/tracing/current_tracer
# echo 1 >/debug/tracing/tracing_enabled
...some commands or activity to trace...
# echo 0 >/debug/tracing/tracing_enabled
This produces a trace of each kernel function called, along with a
timestamp, which will allow a kernel hacker to see what is going on inside
the kernel.
Output from ftrace can be read from one of several files in the
tracing directory:
- trace - contains human-readable output of the trace
- latency_trace - output from the same trace, but organized so
that system latency issues can be diagnosed, also in human-readable format
- trace_pipe - contains the same output as trace, but
is meant to be piped into a command. Unlike the other two, reading from
this file consumes the output.
The ring buffer used by ftrace is a fixed size (governed by the
buffer_size_kb file), so
trace or
latency_trace
entries will eventually be overwritten.
Here are examples of the output from a function trace. The header helps to
decode the various fields in the trace. This is what the output from
trace looks like (heavily edited for brevity's sake):
# tracer: function
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
bash-3330 [000] 147.799029: sys_open <-syscall_call
bash-3330 [000] 147.799030: do_sys_open <-sys_open
bash-3330 [000] 147.799030: getname <-do_sys_open
bash-3330 [000] 147.799031: kmem_cache_alloc <-getname
While this is
latency_trace output (similarly edited):
# tracer: function
#
function latency trace v1.1.5 on 2.6.29-0.215.rc7.fc11.i586
--------------------------------------------------------------------
latency: 0 us, #120119/5425477, CPU#0 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:2)
-----------------
| task: -0 (uid:0 nice:0 policy:0 rt_prio:0)
-----------------
# _------=> CPU#
# / _-----=> irqs-off
# | / _----=> need-resched
# || / _---=> hardirq/softirq
# ||| / _--=> preempt-depth
# |||| /
# ||||| delay
# cmd pid ||||| time | caller
# \ / ||||| \ | /
bash-3330 0.... 3531221us : sys_open (syscall_call)
bash-3330 0.... 3531222us : do_sys_open (sys_open)
bash-3330 0.... 3531222us : getname (do_sys_open)
bash-3330 0.... 3531223us : kmem_cache_alloc (getname)
Each line in the two output formats shows one function call, with one
level of backtrace, along with the process name, process id, and which CPU
the call
was made on. The
latency_trace output also provides information
on whether interrupts have been disabled, whether a reschedule has been
called for, whether its running in an interrupt context, and whether
preemption has been disabled. The timestamp for the
latency_trace
output is relative to the start of the trace in microseconds; the
space after the time, and before the colon, is a field that gets set to either '!' or '+' to call
attention to especially long delays (in the example output it is always
blank). Unfortunately, the header is a little misleading for where the "delay"
and "caller" fields point.
The sysctl kernel.ftrace_enabled governs whether function entry is
recorded as part of the trace. It can be turned
on by using
either of the following commands:
# sysctl kernel.ftrace_enabled=1
or
# echo 1 >/proc/sys/kernel/ftrace_enabled
Without that, many of the tracers are essentially pointless.
Half-a-dozen different tracers are described in the voluminous Documentation/ftrace.txt that comes
in the kernel source tree. In addition to the function tracer, there is
the sched_switch tracer that tracks and reports on context
switches in the
kernel, showing process wakeups along with when they get scheduled. Each
trace entry has a timestamp along with the priorities and states of the
affected processes.
The nop tracer is not a tracer at all, but can be used to clear
whatever tracer is currently active.
There are another seven tracers in the mainline
that have yet to make it into the documentation. In addition, there are
even more tracers that are targeted for the 2.6.30 kernel.
There are four tracers that look for the
maximum time spent in a particular state, recording that maximum time
(in trace_max_latency) along with a trace of the functions called
during that state. irqsoff looks for the longest time spent with
interrupts
disabled, while preemptoff looks for the maximum time spent with
preemption turned off. Combining the two gives the preemptirqsoff
tracer. The final tracer, wakeup looks for the maximum latency
for the highest priority process to get scheduled after it has been woken up.
Each of those helps by reducing the amount of trace data that a kernel
hacker needs to wade through. For kernels that are configured with
CONFIG_DYNAMIC_FTRACE, there is another way to filter the trace
data. Dynamic ftrace allows the user to select kernel functions that are
either included or excluded from the tracing information collected. The
available_filter_functions file lists the functions that can be
filtered, while writing function names to set_ftrace_filter or
set_ftrace_nofilter will include or exclude those functions.
Those lists can be appended to by using the standard >>
shell redirection operator. In addition, there is support for wildcards so
that:
echo 'hrtimer_*' >/debug/tracing/set_ftrace_filter
will only gather tracing information from high-resolution timer functions.
Other tracers available in the mainline kernel (i.e. what will become
2.6.29) include:
- power - traces the CPU power state transitions
- function_graph - similar to the function tracer, but traces
both function entry and exit, which allows a call graph to be generated
- sysprof - periodically (governed by
sysprof_sample_period) generate stack traces for the running
process or kernel thread
- initcall - traces the entry and exit of initialization
function calls during
system boot
- branch - traces branch prediction and execution
- hw-branch-tracer - uses the Branch Target Stack (BTS)
feature of x86 CPUs to trace branch execution
- mmiotrace - traces memory-mapped I/O
The production of human-readable output directly from the kernel recently
led to some hard questions from Andrew Morton:
"Why on earth do we keep on putting all these pretty-printers and
pretty-parsers into the kernel? I mean, how hard is it for userspace
to read a text file, do some basic substitutions and print it out
again?" But, others argue that is precisely because of the ease of
getting a readable trace directly from the kernel which makes ftrace so
useful.
Morton's argument is that the
fixed format is more difficult for
scripts to parse and that the output is English-only. He contends that
some kind of API would simplify things. Pointing to the
latency_trace output as an example, he said:
[...] now you need to think about how this interface would have been
designed if we'd decided to access it with something smarter than
`cat'.
I mean, look at it. All the multi-space column lining upping, the
unnecessary "us" annotation, the strange symbol(symbol) thing, etc.
Plus it would have been more self-describing. Right now, your parser
has to either assume that the second character of "0d..1" is
"irqs-off", or it has to learn how to follow ascii art lines.
Rather than needing some kind of user-space tool to interpret the ftrace
output, a kernel hacker can quickly get what they need from the kernel
itself. As Ingo Molnar pointed out, there
may be no build environment available on the target machine, so having the
trace output available directly is useful: "Self-sufficient kernel
instrumentation is a key concept to usability". Further, he said,
the output has been designed to be scriptable, but if that is not
sufficient, there are options to produce raw output. Overall, he sees
kernel pretty-printing as a needed feature:
IMO pretty-printing in the kernel should not be made a religious
argument but a technical argument. Pretty-printing is a tool,
and as a tool it sometimes can be used for silly purposes, and
sometimes it can be used for helpful purposes. You seem to argue
that doing it in the kernel is always silly - and i strongly
disagree with that and consider it an extreme-end position - for
the reasons outlined above.
The options that Molnar refers to are accessed via the
trace_options file, which lists the current options settings when read,
or it can be written to change options. Three separate options: raw,
hex, and bin (along with several to determine which fields are included)
control the format of the output. These can produce output in formats that
may be easier for some tools to use. Not requiring any tool to get
a useful
trace, though, is seen as a very important part of ftrace, at least by some.
There are lots more options and features of ftrace than are covered here.
The aforementioned ftrace.txt document is a good place to start
for those interested in more details.
Beyond those tracers already in the mainline, there are another handful
that are being readied for 2.6.30. This includes a new event tracer, which
allow tracing various scheduling, locking, and interrupt handling events.
Continuous statistics gathering for workqueues and branch prediction is
included as well.
Ftrace is a useful tool that can provide excellent diagnostic information
from within a running kernel. It is not a general-purpose tool, nor is it
geared to people unfamiliar with the innards of the kernel, but it
certainly has its uses. There is quite a bit of activity going on with
ftrace these days, with numerous patches and enhancements floating around
on linux-kernel. While it may not reduce Dtrace envy directly, it is a
tool that many kernel hackers are using today.
Comments (12 posted)
By Jonathan Corbet
March 18, 2009
We are very much creatures of tradition here at LWN. So, as the 2.6.29
development cycle nears its end, tradition drives us to take a look at
where the code came from in this development cycle.
As of March 17, 11,610 non-merge changesets have been folded into the
2.6.29 kernel. These patches added an amazing 1,228,000 lines of code and
documentation, while removing 401,000; the 2.6.29 kernel will have 827,000
more lines than its predecessor. Some 1420 1166 developers took part in this
cycle. Your editor, sensing that this number represents a record, decided
to look back at previous kernels:
| Release | Developers |
| 2.6.22 | 885 |
| 2.6.23 | 854 |
| 2.6.24 | 950 |
| 2.6.25 | 1124 |
| 2.6.26 | 1027 |
| 2.6.27 | 1022 |
| 2.6.28 | 1078 |
| 2.6.29 | 1166 |
It would seem that there is a clear trend here: the kernel development
community has grown
significantly over the last couple of years. The number of employers
represented by these developers (175) has grown a little, but the
uncertainties involved in associating developers with employers argue
against reading too much into that particular number. Suffice to say that
quite a few companies are supporting kernel development work.
Here are the individual developer statistics:
| Most active 2.6.29 developers |
| By changesets |
| Chris Mason | 671 | 5.8% |
| Takashi Iwai | 173 | 1.5% |
| Jaswinder Singh Rajput | 158 | 1.4% |
| Stephen Hemminger | 154 | 1.3% |
| Mike Frysinger | 150 | 1.3% |
| Christoph Hellwig | 143 | 1.2% |
| Ben Dooks | 142 | 1.2% |
| Alexey Dobriyan | 138 | 1.2% |
| Ingo Molnar | 133 | 1.1% |
| Rusty Russell | 127 | 1.1% |
| Steven Rostedt | 110 | 0.9% |
| Mauro Carvalho Chehab | 109 | 0.9% |
| Mark Brown | 108 | 0.9% |
| Sam Ravnborg | 108 | 0.9% |
| David S. Miller | 107 | 0.9% |
| Greg Kroah-Hartman | 105 | 0.9% |
| Harvey Harrison | 101 | 0.9% |
| David Howells | 100 | 0.9% |
| Russell King | 93 | 0.8% |
| Paul Mundt | 87 | 0.7% |
|
| By changed lines |
| Greg Kroah-Hartman | 280883 | 20.8% |
| Luis R. Rodriguez | 71604 | 5.3% |
| Chris Mason | 69935 | 5.2% |
| Daniel Krueger | 56534 | 4.2% |
| David Kiliani | 41371 | 3.1% |
| David Daney | 18767 | 1.4% |
| Solomon Peachy | 17386 | 1.3% |
| Robert Love | 15262 | 1.1% |
| Sujith | 14703 | 1.1% |
| Inaky Perez-Gonzalez | 14388 | 1.1% |
| David S. Miller | 13422 | 1.0% |
| Jesse Barnes | 13036 | 1.0% |
| Christoph Hellwig | 12548 | 0.9% |
| Michael Hennerich | 12334 | 0.9% |
| Subbu Seetharaman | 12285 | 0.9% |
| Jaswinder Singh Rajput | 11651 | 0.9% |
| Rusty Russell | 10878 | 0.8% |
| Ben Dooks | 10809 | 0.8% |
| David Schleef | 10325 | 0.8% |
| Mark Brown | 9945 | 0.7% |
|
Chris Mason comes out on top of the "changesets" category as a result of
the Btrfs merge. It is a significant body of code, to be sure, but the
changeset count is as high as it is because the entire Btrfs development
history was merged. So we're seeing rather more than three months worth of
work there. Takasi Iwai did a great deal of work in the ALSA subsystem,
and in the Intel HDA driver in particular. Jaswinder Singh Rajput
contributed quite a few patches of the cleanup variety. Stephen
Hemminger's work consisted mainly of changing the network driver API, then
fixing a long list of broken drivers. And Michael Frysinger contributed a
lot of changes to the Blackfin architecture.
If one looks at the number of lines changed, the picture is a little
different. As with 2.6.28, Greg Kroah-Hartman fed large amounts of crap
(his word) into the -staging tree; that code does not retain the original
author information within the git repository (though, of course, credits in
the code itself are unchanged). Luis Rodriguez did a lot of work on
Atheros wireless drivers and the cfg80211 subsystem; much of this work was
associated with regulatory
compliance support. Daniel Krueger achieved his place on the list by
contributing a single patch: the Systec Electronic openPOWERLINK network
stack. David Kiliani is another one-patch wonder; his was a driver for
Meilhaus ME-IDS data collection cards. Daniel and David's patches both
went into the -staging tree. So, three of the top five code contributors
put their work in by way of -staging.
The associated employer statistics look like this:
| Most active 2.6.29 employers |
| By changesets |
| (None) | 1612 | 13.9% |
| (Unknown) | 1378 | 11.9% |
| Red Hat | 1229 | 10.6% |
| Oracle | 992 | 8.5% |
| IBM | 749 | 6.5% |
| Intel | 686 | 5.9% |
| Novell | 632 | 5.4% |
| (Consultant) | 370 | 3.2% |
| Analog Devices | 282 | 2.4% |
| Fujitsu | 212 | 1.8% |
| (Academia) | 204 | 1.8% |
| Renesas Technology | 165 | 1.4% |
| Nokia | 163 | 1.4% |
| Vyatta | 154 | 1.3% |
| Parallels | 149 | 1.3% |
| Simtec | 138 | 1.2% |
| Atheros Communications | 131 | 1.1% |
| AMD | 130 | 1.1% |
| Wolfson Microelectronics | 104 | 0.9% |
| SGI | 100 | 0.9% |
|
| By lines changed |
| Novell | 306183 | 22.7% |
| (Unknown) | 197224 | 14.6% |
| Atheros Communications | 96202 | 7.1% |
| Oracle | 93846 | 7.0% |
| (None) | 92811 | 6.9% |
| Red Hat | 77087 | 5.7% |
| Intel | 62265 | 4.6% |
| SYS TEC electronic GmbH | 56534 | 4.2% |
| Analog Devices | 44659 | 3.3% |
| IBM | 40560 | 3.0% |
| (Consultant) | 28983 | 2.1% |
| Cavium Networks | 18767 | 1.4% |
| Renesas Technology | 16946 | 1.3% |
| Nokia | 11951 | 0.9% |
| Simtec | 10886 | 0.8% |
| Broadcom | 10314 | 0.8% |
| Wolfson Microelectronics | 10147 | 0.8% |
| Freescale | 8520 | 0.6% |
| Chelsio | 7738 | 0.6% |
| QLogic | 7322 | 0.5% |
|
The employer numbers tend not to change radically from one release to the
next; many of the same companies show up every time. New appearances this
time include Vyatta (which supports Stephen Hemminger's work) and some
companies (Simtec, SYS TEC, Cavium Networks) which contributed
support for their own products.
The number of patches with Reviewed-by tags remains relatively small - less
than 5% of the total. The top credited reviewers this time around are:
| Developers with the most reviews |
| James Morris | 64 | 20.2% |
| Dave Chinner | 51 | 16.1% |
| Christoph Hellwig | 39 | 12.3% |
| Andrew Morton | 14 | 4.4% |
| Eric Sandeen | 12 | 3.8% |
| Daisuke Nishimura | 11 | 3.5% |
| KOSAKI Motohiro | 10 | 3.2% |
| Matthew Wilcox | 8 | 2.5% |
| WANG Cong | 7 | 2.2% |
| Zhu, Yi | 5 | 1.6% |
| KAMEZAWA Hiroyuki | 5 | 1.6% |
| Eric Anholt | 4 | 1.3% |
| Pekka Enberg | 4 | 1.3% |
| Tomas Winkler | 4 | 1.3% |
| Paul Menage | 4 | 1.3% |
| Mike Christie | 4 | 1.3% |
| Grant Grundler | 4 | 1.3% |
These numbers remain an artifact of how the reporting of reviews is done;
certainly there is more code review than this going on. The same is true
for reporting and testing credits. For 2.6.29, the numbers are:
| Most credited 2.6.29 reporters and testers |
| Reported-by credits |
| Randy Dunlap | 13 | 3.8% |
| Ingo Molnar | 7 | 2.0% |
| Li Zefan | 6 | 1.7% |
| Alexander Beregalov | 5 | 1.5% |
| Stephen Rothwell | 5 | 1.5% |
| Stefan Richter | 4 | 1.2% |
| Johannes Berg | 4 | 1.2% |
| Eric Sesterhenn | 4 | 1.2% |
| Kamalesh Babulal | 4 | 1.2% |
| Larry Finger | 3 | 0.9% |
| Linus Torvalds | 3 | 0.9% |
| Andrew Morton | 3 | 0.9% |
| Guennadi Liakhovetski | 3 | 0.9% |
| Huang Ying | 3 | 0.9% |
| Daisuke Nishimura | 3 | 0.9% |
| Meelis Roos | 3 | 0.9% |
| Geert Uytterhoeven | 3 | 0.9% |
|
| Tested-by: credits |
| Hin-Tak Leung | 14 | 5.2% |
| Mike Frysinger | 7 | 2.6% |
| Larry Finger | 7 | 2.6% |
| Ingo Molnar | 6 | 2.2% |
| Herton Ronaldo Krzesinski | 6 | 2.2% |
| Li Zefan | 5 | 1.9% |
| Daisuke Nishimura | 4 | 1.5% |
| KAMEZAWA Hiroyuki | 4 | 1.5% |
| Andrew Patterson | 4 | 1.5% |
| Meelis Roos | 4 | 1.5% |
| KOSAKI Motohiro | 3 | 1.1% |
| Stephen Gildea | 3 | 1.1% |
| Robert Jarzmik | 3 | 1.1% |
| Serge Hallyn | 3 | 1.1% |
| Eric Sesterhenn | 3 | 1.1% |
|
All told, 2.6.29 was one of the most active development cycles yet, with
vast amounts of code finding its way into the kernel and a record number of
developers participating. The development community might be justified in
taking a rest after this much work, but the kernel process, it seems, never
stops. There is already a lot of work waiting for the 2.6.30 merge window
to open, at which point the whole cycle will start anew.
(Thanks, as always, to Greg Kroah-Hartman for his help in assembling these
statistics.)
Comments (3 posted)
March 17, 2009
This article was contributed by Paul McKenney
Introduction
Read-copy update (RCU) is a synchronization mechanism that was added to
the Linux kernel in October of 2002.
RCU improves scalability
by allowing readers to execute concurrently with writers.
In contrast, conventional locking primitives require that readers
wait for ongoing writers and vice versa.
RCU ensures read coherence by
maintaining multiple versions of data structures and ensuring that they are not
freed up until all pre-existing read-side critical sections complete.
RCU relies on efficient and scalable mechanisms for publishing
and reading new versions of an object, and also for deferring the collection
of old versions.
These mechanisms distribute the work among read and
update paths in such a way as to make read paths extremely fast. In some
cases (non-preemptable kernels), RCU's read-side primitives have zero
overhead.
RCU updates can be expensive, so RCU is in general best-suited to
read-mostly workloads.
Although Classic RCU's memory footprint has been acceptable,
hierarchical RCU
has a memory footprint that is considerably larger.
I was surprised to find that this additional memory consumption
did not greatly concern the embedded Linux people I talked to,
but then again,
I certainly do not know everyone in the embedded Linux community,
and the complexity of both
hierarchical RCU,
preemptable RCU,
and
the dynticks interface
made the thought of a near-minimal kernel-capable RCU with a classic RCU
API more interesting.
In addition, a recent
show of hands at linux.conf.au
indicated that there is still interest in running Linux on small-memory
single-CPU systems.
Discussions with Josh Triplett identified an attractive optimization
for the uniprocessor case, and so
“RCU: The Bloatwatch Edition” was born.
Contents:
-
Review of RCU Fundamentals
- Design of Tiny RCU
- Tiny-RCU Code Walkthrough
- Testing
- Memory Footprint
These sections are followed by
concluding remarks and the
answers to the Quick Quizzes.
This section is quite similar to its counterpart in the
description of
hierarchical RCU.
People familiar with RCU semantics may wish to proceed directly to the
next section.
In its most basic form, RCU is a way of waiting for things to finish.
Of course, there are a great many other ways of waiting for things to
finish, including reference counts, reader-writer locks, hashed locks,
events, and so on.
The great advantage of RCU is that it can wait for each of
(say) 20,000 different things without having to explicitly
track each and every one of them, and without having to worry about
the performance degradation, scalability limitations, complex deadlock
scenarios, and memory-leak hazards that are inherent in schemes
using explicit tracking.
In RCU's case, the things waited on are called
"RCU read-side critical sections".
An RCU read-side critical section starts with an
rcu_read_lock() primitive, and ends with a corresponding
rcu_read_unlock() primitive.
RCU read-side critical sections can be nested, and may contain pretty
much any code, as long as that code does not explicitly block or sleep.
If you abide by these conventions, you can use RCU to wait for pretty
much any desired piece of code to complete.
Quick Quiz 1:
But what if I need my RCU read-side critical section to sleep and block?
RCU accomplishes this feat by indirectly determining when these
other things have finished, as has been described elsewhere for
Classic RCU and
realtime RCU.
In particular, as shown in the following figure, RCU is a way of
waiting for pre-existing RCU read-side critical sections to completely
finish, including memory operations executed by those critical sections.
However, note that RCU read-side critical sections
that begin after the beginning
of a given grace period can and will extend beyond the end of that grace
period.
The following section gives a very high-level view of how
the Tiny RCU implementation operates.
The key restriction that enables a smaller and simpler RCU
implementation is CONFIG_SMP=n, which means that
any time the sole CPU passes through a quiescent state, a
grace period has elapsed.
In principle, the scheduler could simply invoke all pending RCU callbacks
on each context switch, but in practice it is wise to maintain a degree
of isolation between the scheduler and RCU.
An arms-length relationship allows RCU callbacks to invoke appropriate
scheduler functions (e.g., wake_up()) without potential
deadlock issues.
Therefore, the RCU core runs in softirq context.
This uniprocessor approach also simplifies the data structure, so
that each flavor of RCU (rcu_ctrlblk and
rcu_bh_ctrlblk) has the following structure:
1 struct rcu_ctrlblk {
2 long completed;
3 struct rcu_head *rcucblist;
4 struct rcu_head **donetail;
5 struct rcu_head **curtail;
6 };
The ->completed field is required only for the
rcu_batches_completed() and
rcu_batches_completed_bh() interfaces used by the
RCU torture tests.
The ->rcucblist field is the header for the list
of RCU callbacks (rcu_head structures), the
->donetail field references the next
pointer of the last rcu_head structure in the list
whose grace period has completed, and the ->curtail
field references the next pointer of the last
rcu_head structure in the list.
The following figure shows a sample callback list that has two callbacks
ready to invoke and a third callback still waiting for a grace
period (or, equivalently on a uniprocessor system, for a quiescent
state):
Quick Quiz 2:
But we should be able to simply execute all callbacks at each
quiescent state!
So why bother with separate ->donetail and
->curtail sublists?
A block diagram of tiny RCU appears to the right; see this page for a full-size version.
The blue rectangles in this diagram represent functions making up
tiny RCU, the blue rectangles with rounded corners represent
data structures within tiny RCU, and the red rectangles represent
other parts of the Linux kernel.
Solid arrows represent function-call interfaces, while dashed arrows
represent indirect invocation.
The RCU read-side and list-manipulation primitives are not shown,
nor are the interfaces specific to the "rcutorture" test module.
Like classic and hierarchical RCU, tiny RCU's grace-period detection
is driven by context switches, the scheduling-clock interrupt
(scheduler_tick()), and the idle loop.
The scheduling-clock interrupt handler and idle loops contain code as follows:
if (rcu_pending(cpu))
rcu_check_callbacks(cpu, 0);
So if rcu_pending() indicates that the RCU core has
any work to do,
rcu_check_callbacks() is invoked, which in turn checks to see if
the CPU is currently in a quiescent state, invoking either or both
of rcu_qsctr_inc() and rcu_bh_qsctr_inc()
as appropriate.
These in turn invoke rcu_qsctr_help(), which, if there are
RCU callbacks present, updates the callback lists to indicate that
their grace period has elapsed and returns 1 to tell the caller to
invoke raise_softirq().
At some later time, rcu_process_callbacks() will be invoked
from softirq context, which, via __rcu_process_callbacks(),
invokes all callbacks in rcu_ctrlblk and
rcu_bh_ctrlblk.
Tiny RCU also interfaces to dynticks, albeit in a slightly different way than
do the classic, preemptable, and hierarchical RCU implementations.
Because there is but a single CPU, entering dynticks-idle mode is both
a quiescent state and a grace period, allowing rcu_qsctr_inc() to
directly invoke raise_softirq(). This direct invocation in turn
means that rcu_needs_cpu() can simply return zero, since any
callbacks are processed on the way into dynticks-idle state.
The call_rcu() and call_rcu_bh()
interfaces queue callbacks
via the __call_rcu() helper function.
The synchronize_rcu() interface is a special case that will be
described in the code walkthrough in the next section.
Interestingly enough, the general shape of the above block diagram
applies to all RCU implementations.
Of course, the processing is more complex in the CONFIG_SMP
case, particularly in the __rcu_process_callbacks() area.
This code walkthrough starts with tiny RCU's update-side API, then discusses
the grace-period machinery, and finally the dynticks interface.
Update-Side API
The code for synchronize_rcu() is as follows:
1 void synchronize_rcu(void)
2 {
3 unsigned long flags;
4
5 local_irq_save(flags);
6 rcu_ctrlblk.completed++;
7 local_irq_restore(flags);
8 }
This code merely increments the ->completed field
with interrupts disabled.
This works because synchronize_rcu() may only be invoked
when it is legal to block, in other words, synchronize_rcu()
cannot be called from within an RCU read-side critical section.
Therefore, anywhere synchronize_rcu() may legally
be invoked is guaranteed to be a quiescent state.
Because quiescent states are also grace periods on uniprocessor systems,
as noted by Josh Triplett,
any call to synchronize_rcu() is automatically a grace
period.
Quick Quiz 3:
If all calls to synchronize_rcu() are automatically
grace periods, why isn't synchronize_rcu() simply
an empty function?
Quick Quiz 4:
If all calls to synchronize_rcu() are automatically
grace periods, why doesn't synchronize_rcu() process
any pending RCU callbacks?
The following shows the code for the call_rcu() group
of functions:
1 static void __call_rcu(struct rcu_head *head,
2 void (*func)(struct rcu_head *rcu),
3 struct rcu_ctrlblk *rcp)
4 {
5 unsigned long flags;
6
7 head->func = func;
8 head->next = NULL;
9 local_irq_save(flags);
10 *rcp->curtail = head;
11 rcp->curtail = &head->next;
12 local_irq_restore(flags);
13 }
14
15 void call_rcu(struct rcu_head *head,
16 void (*func)(struct rcu_head *rcu))
17 {
18 __call_rcu(head, func, &rcu_ctrlblk);
19 }
20
21 void call_rcu_bh(struct rcu_head *head,
22 void (*func)(struct rcu_head *rcu))
23 {
24 __call_rcu(head, func, &rcu_bh_ctrlblk);
25 }
Lines 1-13 show the code for __call_rcu(), which is a
common-code helping function.
Lines 7-8 initialize the specified rcu_head
structure.
Line 9 disables interrupts (and line 12 restores them),
so that lines 10-11 can enqueue the callback undisturbed by
interrupt handlers that might also invoke call_rcu().
Lines 15-19 and lines 21-25 are simple wrappers
implementing call_rcu() (which enqueues the callback
to rcu_ctrlblk) and call_rcu_bh()
(which enqueues the callback to rcu_bh_ctrlblk),
respectively.
The callbacks are enqueued to the last segment of the queue, in other
words, to the portion still waiting for a grace period to end.
Grace-Period Machinery
The lowest-level grace-period machinery is supplied by
the rcu_qsctr_inc() family of interfaces that
report passage through a quiescent state.
These functions are implemented as follows:
1 static int rcu_qsctr_help(struct rcu_ctrlblk *rcp)
2 {
3 if (rcp->rcucblist != NULL &&
4 rcp->donetail != rcp->curtail) {
5 rcp->donetail = rcp->curtail;
6 return 1;
7 }
8 return 0;
9 }
10
11 void rcu_qsctr_inc(int cpu)
12 {
13 if (rcu_qsctr_help(&rcu_ctrlblk) ||
14 rcu_qsctr_help(&rcu_bh_ctrlblk))
15 raise_softirq(RCU_SOFTIRQ);
16 }
17
18 void rcu_bh_qsctr_inc(int cpu)
19 {
20 if (rcu_qsctr_help(&rcu_bh_ctrlblk))
21 raise_softirq(RCU_SOFTIRQ);
22 }
Lines 1-9 show rcu_qsctr_help(), which is a
common-code helper function for rcu_qsctr_inc()
(lines 11-16) and rcu_bh_qsctr_inc()
(lines 18-22), which in turn report ``rcu'' and ``rcu_bh''
quiescent states, respectively.
Within rcu_qsctr_help,
lines 3-4 check to see if there are any RCU callbacks
within the specified rcu_ctrlblk structure that are still
waiting for their grace period to complete, and, if so,
line 5 updates the pointers so as to mark them as being ready to
invoke, and line 6 returns non-zero in order to tell the caller
to cause the callback-processing code to start running.
Otherwise, line 8 returns zero to tell the caller that it is not necessary
to process callbacks.
The rcu_qsctr_inc() function
invokes rcu_qsctr_help()
twice, once for rcu and once again for rcu_bh, and, if either indicates
that callback processing is required, it also invokes
raise_softirq(RCU_SOFTIRQ) to cause callback processing
to be performed.
Quick Quiz 5:
Why not simply directly call rcu_process_callbacks()?
The rcu_bh_qsctr_inc() function
invokes rcu_qsctr_help(), but only for rcu_bh.
If the return value indicates that callback processing is required,
raise_softirq(RCU_SOFTIRQ) is invoked to make this happen.
Dynticks Interface
The tinyrcu dynticks interface is refreshingly simple, because uniprocessor
systems need not worry about the dynticks-idle state of the nonexistent
other CPUs.
The code is as follows:
1 void rcu_enter_nohz(void)
2 {
3 if (--dynticks_nesting == 0)
4 rcu_qsctr_inc(0);
5 }
6
7 void rcu_exit_nohz(void)
8 {
9 dynticks_nesting++;
10 }
11
12 void rcu_irq_enter(void)
13 {
14 rcu_exit_nohz();
15 }
16
17 void rcu_irq_exit(void)
18 {
19 rcu_enter_nohz();
20 }
21
22 void rcu_nmi_enter(void)
23 {
24 }
25
26 void rcu_nmi_exit(void)
27 {
28 }
The rcu_enter_nohz() function is shown on
lines 1-5.
Line 3 decrements the dynticks_nesting global variable,
and, if the result is zero, the CPU has entered dynticks-idle
mode, which is a quiescent state.
(The dynticks_nesting global variable is initialized to 1.)
Quick Quiz 6:
Why is only rcu_qsctr_inc() invoked?
What about rcu_bh_qsctr_inc()?
The rcu_exit_nohz() function is shown on lines 7-10.
This function simply increments the dynticks_nesting global
variable.
As can be seen on lines 12-20, the rcu_irq_enter()
and rcu_irq_exit() functions are simple wrappers around
rcu_exit_nohz() and rcu_enter_nohz(),
respectively.
Please note that entering an interrupt handler exits dynticks-idle
mode from an RCU viewpoint, and vice versa.
This often-confusing relationship is maintained in the names of these
functions.
Lines 22-28 show rcu_nmi_enter() and
rcu_nmi_exit(), both of which are empty functions.
Because we are running on a uniprocessor, we can safely ignore
NMI handlers.
The reason this is safe is that there cannot be any quiescent states
on this CPU within an NMI handler, and there are no other CPUs to
execute concurrent quiescent states.
Finally, the rcu_needs_cpu function (not shown)
simply returns non-zero,
indicating that RCU is always prepared for a given CPU to go into
dynticks-idle mode.
Running the script
included in the
hierarchical RCU article
shows that the CONFIG_NO_HZ kernel config parameter
is important, and further
investigation identifies the CONFIG_PREEMPT parameter
as well.
This results in a nice small set of testing scenarios:
-
!CONFIG_PREEMPT and !CONFIG_NO_HZ
-
!CONFIG_PREEMPT and CONFIG_NO_HZ
-
CONFIG_PREEMPT and !CONFIG_NO_HZ
-
CONFIG_PREEMPT and CONFIG_NO_HZ
This is a welcome contrast to the situation for hierarchical RCU.
Simpler code means smaller memory footprint, as can be seen in
the following table:
| Memory use | |
| Implementation |
text |
data |
bss | total |
filename |
| Classic RCU |
363 |
12 |
24 |
399 |
kernel/rcupdate.o |
| 1237 |
64 |
124 |
1425 |
kernel/rcuclassic.o |
|
|
|
1824 |
Classic RCU Total |
| |
| Hierarchical RCU |
363 |
12 |
24 |
399 |
kernel/rcupdate.o |
| 2344 |
240 |
184 |
2768 |
kernel/rcutree.o |
|
|
|
3167 |
Hierarchical RCU Total |
| |
| Tiny RCU |
294 |
12 |
24 |
330 |
kernel/rcupdate.o |
| 563 |
36 |
0 |
599 |
kernel/rcutiny.o |
|
|
|
929 |
Tiny RCU Total |
Tiny RCU is about twice as small as Classic RCU (almost 900 bytes
saved), and more
than three times smaller than Hierarchical RCU (more than 2KB saved).
As RCU's capabilities have grown, so has its code size.
This Bloatwatch edition of RCU forces this trend in reverse,
producing a very small
implementation via the !CONFIG_SMP restriction.
This implementation also provides a minimal Linux-kernel-capable
implementation that may provide a good starting point for people wishing
to learn about RCU implementations.
Acknowledgements
I am grateful to Josh Triplett for the conversation that got this
project started,
to Ingo Molnar and David Howells for their encouragement to see this through,
to Will Schmidt for his help getting this to human-readable state,
and to Kathy Bennett for her support of this effort.
This work represents the view of the authors and does not necessarily
represent the view of IBM.
Quick Quiz 1:
But what if I need my RCU read-side critical section to sleep and block?
Answer:
A special form of RCU called
"SRCU"
does permit general sleeping in SRCU read-side critical sections.
However, it is usually better to rework your RCU read-side critical
section to avoid sleeping and blocking.
One other exception is
preemptable RCU,
which allows RCU read-side
critical sections to be preempted and to block while acquiring
what in non-RT kernels would be spinlocks.
Back to Quick Quiz 1.
Quick Quiz 2:
But we should be able to simply execute all callbacks at each
quiescent state!
Why to we need to separate ->donetail and
->curtail sublists?
Answer:
Recall that the callbacks are not invoked directly from the scheduler,
but rather from softirq context.
It would be illegal to invoke callbacks that were registered
after the quiescent state but before softirq commenced execution.
Such callbacks could be registered from within irq handlers by
invoking call_rcu(), and these irq handlers could be
invoked between the time that the quiescent state occurred and the
time that the softirq handler started executing.
Back to Quick Quiz 2.
Quick Quiz 3:
If all calls to synchronize_rcu() are automatically
grace periods, why isn't synchronize_rcu() simply
an empty function?
Answer: If synchronize_rcu() were an
empty function, then rcutorture would incorrectly conclude that
RCU was not working, as it would appear to rcutorture that grace
periods were never ending.
Furthermore, synchronize_rcu() must at least prevent
the compiler from reordering code across the call to
synchronize_rcu().
Back to Quick Quiz 3.
Quick Quiz 4:
If all calls to synchronize_rcu() are automatically
grace periods, why doesn't synchronize_rcu() process
any pending RCU callbacks?
Answer:
Because there currently appears to be no need to do so, given that
callbacks will be processed on the next context switch or the next
scheduling-clock interrupt from either user-mode execution or from
the idle loop.
Should experience demonstrate that these are insufficient, then
one approach would be to add a raise_softirq(RCU_SOFTIRQ)
statement to synchronize_rcu().
Back to Quick Quiz 4.
Quick Quiz 5:
Why not simply directly call rcu_process_callbacks()?
Answer:
Deferring execution to the softirq environment disentangles RCU callback
invocation from the scheduler, permitting the callbacks to freely
invoke things like wake_up().
In addition, for rcu_bh, quiescent states might occur with locks
held, and the RCU callbacks might well need to acquire these locks,
potentially resulting in deadlock.
In both cases, deferring to the softirq environment ensures a clean
state for the callback.
Back to Quick Quiz 5.
Quick Quiz 6:
Why is only rcu_qsctr_inc() invoked?
What about rcu_bh_qsctr_inc()?
Answer:
Because rcu_qsctr_inc() includes an implicit
rcu_bh_qsctr_inc(), as can be seen from the code
in the previous section.
Back to Quick Quiz 6.
Comments (4 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>