September 13, 2006
This article was contributed by Valerie Henson
The Problem
Kernel developers have written many wonderful and useful tools for
debugging and observing system behavior, such as slab allocation
debugging, lock dependency
tracking, and scheduler statistics. However, few of these tools
can be used in production systems (those are computers used to do
actual work as opposed to what I use them for, which is
compiling and testing my latest kernel patches) because of the
overhead they create, even when disabled. Whenever Dave Jones is
trying to track down a memory allocation bug in Rawhide and turns on
slab debugging, he's inundated with complaints about sluggish systems
until he turns it back off again.
We also lack decent tools to do system-wide analysis - analysis
spanning the operating system and all running processes - since most
tools are built around either a single process (e.g., strace) or a
single kernel subsystem (e.g., SCSI logging). When it comes down to
root-causing a performance problem on a production system, our hands
are pretty much tied if we can't boot into a kernel compiled with
support for debugging and tracing - and often we can't reboot, either
due to downtime restrictions or rules about certification of software
on production systems.
Today, performance analysis on production Linux systems usually ends
up being a jumble of iostat, top, sysrq-t, random /proc entries, and
unreliable oprofile results (if we're lucky enough to have oprofile).
Recently, one of my friends with extensive Linux experience upgraded
his business's production system (a computer used to do actual
work) to a more recent Linux kernel and found that performance
had suddenly dropped to an unusable level. Once he had figured out
that many Apache processes were spending a lot of time in iowait, he
had no idea where to go next and had to revert to the old kernel
without root-causing the problem. Unfortunately, the problem is only
reproducible on a system in production use - and so must be
investigated using only tools suitable for a production system.
System-wide performance analysis on present-day Linux systems remains
a black art.
The Solution
The ideal tracing system would cause zero performance degradation when
it is disabled, would be dynamically enabled as needed, could collect
data over an entire system, and would be safe to use on a production
system. The paper describing DTrace,
Dynamic Instrumentation of Production Systems, published in
the USENIX 2004 Annual
Technical Conference, earns itself a place on the Kernel Hacker's
Bookshelf for describing the first system that lives up to this ideal.
DTrace was originally written for Solaris on both SPARC and x86, and
has recently been ported to Mac OS
X. I used DTrace extensively while I was working on Solaris and
got used to being able to answer any question I had about a system
with a few minutes of script writing. When I went back to work on
Linux and could no longer use DTrace, I felt like I went from wielding
a sharp steel katana to fumbling with dull flint tools. The only tool
for Linux that comes close is SystemTap, which has
improved significantly in the last year, though it still remains out
of the mainline kernel.
I'm not the only person who thinks DTrace is ground-breaking. DTrace
won the top award in the
Wall Street Journal's 2006 Technology Awards. MIT's Technology
Review named DTrace's lead engineer, Bryan Cantrill, as one of their 2005 TR35
winners, their list of top innovators under the age of 35. Any
company with a half-decent marketing group can generate hype, but
DTrace has garnered praise from both industry leaders and the
people knuckling down to do the real work.
The Paper
The
DTrace
paper begins with the motivation for DTrace. For many years,
Solaris developers, like Linux developers, focused on writing tools to
help them in a kernel development environment. Then they began
venturing out into the field to analyze real-world systems - and
discovered that much of their toolkit was useless. Besides being
impossible to use on production systems, their tools were designed to
analyze processes or the kernel in isolation. They began to design a
dynamic tracing system intended from its inception for use in
production systems. It needed to be completely safe, have zero probe
effect, aggregate data over the whole system, lose a minimum of trace
data, and allow arbitrary instrumentation of any part of the system.
The architecture they came up with divides up the work of tracing into
several modular components. The first is DTrace providers. These are
kernel modules that know how to create and enable a particular class
of DTrace probes. DTrace providers include things like function
boundary tracing and virtual memory info tracing. When enabled, each
DTrace probe has one or more series of actions associated with it that
are executed by the DTrace framework (another kernel module) each time
the probe fires, such as "Record the timestamp" or "Get the user stack
of this thread." Actions can have predicates - conditions that must
be met for the the action to be taken. This is one way to cut down on
the amount of data that would otherwise be laboriously copied out of
the kernel, only to be thrown away in post-processing. A useful
predicate might be "Only if the pid is 7893" or "Only if the first
argument is non-zero."
Probes are enabled by DTrace consumers - processes which tell the
DTrace framework what probe points and actions they want to use.
Probes can have multiple consumers. Each consumer has its own set of
per-CPU buffers for transferring trace data out of the kernel, which
is done is such a way that data is never corrupted, and the consumer
is notified if data is lost. Many tracing systems silently drop data,
which can lead to serious errors in analysis when an event is
significantly under-sampled.
The most interesting and controversial part of DTrace is the scripting
language, "D", and its conversion to the D Intermediate Format, DIF.
Many developers don't understand why C and native machine code aren't
preferable - after all, we already know C, and we have plenty of tools
for compiling C into runnable machine code. Why reinvent the wheel?
The answer comes in two parts.
First, D was invented to quickly form questions about a running
system. A quote from the paper: "Our experience showed that D
programs were rapidly developed and edited and often written directly
on the dtrace(1M) command line." As such, it lends itself to a
script-like language that is friendly to rapid prototyping. It is also
intended primarily to gather and process data, and as such an awk or
python-like structure was more appropriate. The language used to
specify probe actions should be specialized for the task at hand,
rather than simply reusing a language designed for generic system
programming. At the same time, D is very similar to C (the paper
describes D as "a companion language to C") and C programmers can
quickly learn D.
Second, some level of emulation is needed for safety. Not all program
errors can be caught in an initial pass; things like illegal
dereferences must be caught and handled on the fly. The in-kernel DIF
emulator is vital for the level of safety needed to use DTrace on a
production system. When explaining to Linux developers the need to
prevent buggy scripts from crashing the system, often the response is,
"Well, don't do that." But imagine for a minute that you are
debugging with SystemTap on your friend's production Linux server.
When they ask you if it could possibly crash their system (which will
cost them many thousands of dollars in lost business), you don't want
to say, "Well, only if I have a bug in the scripts I am writing... on
the fly... without code review... Um, how many thousands of dollars
did you say?" A tracing system that can still cause the system to
crash in some situations will be limited to kernel developers,
students, and other people with the luxury of unscheduled downtime.
Two major components of DTrace remain: aggregations and speculative
tracing, two methods of reducing trace data at the source, allowing
far greater flexibility of tracing. The traditional method of tracing
involves generating vast quantities of data, shoveling it out to user
space as fast as possible, and then sifting through the detritus with
post-processing scripts. The downsides of this approach are data loss
(there is a limit to how quickly data can be copied out of the
kernel), limitations on what we can trace (without excessive data
loss), and expensive post-processing times. If we instead throw away
or coalesce trace data at the source, our tracing is cheaper and more
flexible.
One method of data pruning is aggregations, which coalesce a set of
data into a useful summary. For example, with only a few lines of D,
you can create an aggregation that collects a frequency distribution
of the size of mmap function calls across all processes on the system.
The alternative is copying out the entire set of trace data for each
mmap call on the system, then writing a script to extract the sizes
and calculate the distribution - which is slower, more error-prone,
and has a much higher probe effect.
Speculative tracing is even more interesting; it allows a script to
collect trace data and then decide whether to throw it away or pass it
back up to user space. This is vital for collecting data for a common
event, of which only a few events are judged "interesting" later on.
For example, if you want to trace the entire call path of all system
calls that result in a particular error code, you can speculatively
trace each system call, but throw away the data for all system calls
except the ones with the interesting error code.
If you don't have much time to read the DTrace paper, be sure to at
least read Section 9, which describes a session root-causing a
mysterious performance problem on a large server with hundreds of
users. In the end, 6 instances of a stock ticker applet were putting
so much load on the X server that killing them resulted in an increase
in system idle time of 15% (!!!). More DTrace
examples are available, linked to from the DTrace
OpenSolaris web site.
What does this mean for Linux?
Hopefully anyone who saw Dave Jones'
Why
Userspace Sucks talk at
OLS 2006
will already be excited about using
SystemTap to track down
problems. SystemTap is the current state of the art dynamic tracing
system for Linux. It has little or no probe effect - performance
degradation when it is disabled - and it can trace events across the
system.
However, it still has some way to go in the areas of safety,
early data processing, and general usability.
Understanding the
DTrace paper will help people understand why these areas are
important. More importantly, understanding the DTrace paper will help
people understand how they can use SystemTap to solve interesting
problems.
Bored? Lonely? Download SystemTap and start investigating
performance problems today! If you're running FC4, you can even install
SystemTap using yum.
(
Log in to post comments)