LWN.net Logo

KHB: Dynamic Instrumentation of Production Systems (a.k.a. DTrace)

September 13, 2006

This article was contributed by Valerie Henson

The Problem

Kernel developers have written many wonderful and useful tools for debugging and observing system behavior, such as slab allocation debugging, lock dependency tracking, and scheduler statistics. However, few of these tools can be used in production systems (those are computers used to do actual work as opposed to what I use them for, which is compiling and testing my latest kernel patches) because of the overhead they create, even when disabled. Whenever Dave Jones is trying to track down a memory allocation bug in Rawhide and turns on slab debugging, he's inundated with complaints about sluggish systems until he turns it back off again.

We also lack decent tools to do system-wide analysis - analysis spanning the operating system and all running processes - since most tools are built around either a single process (e.g., strace) or a single kernel subsystem (e.g., SCSI logging). When it comes down to root-causing a performance problem on a production system, our hands are pretty much tied if we can't boot into a kernel compiled with support for debugging and tracing - and often we can't reboot, either due to downtime restrictions or rules about certification of software on production systems.

Today, performance analysis on production Linux systems usually ends up being a jumble of iostat, top, sysrq-t, random /proc entries, and unreliable oprofile results (if we're lucky enough to have oprofile). Recently, one of my friends with extensive Linux experience upgraded his business's production system (a computer used to do actual work) to a more recent Linux kernel and found that performance had suddenly dropped to an unusable level. Once he had figured out that many Apache processes were spending a lot of time in iowait, he had no idea where to go next and had to revert to the old kernel without root-causing the problem. Unfortunately, the problem is only reproducible on a system in production use - and so must be investigated using only tools suitable for a production system. System-wide performance analysis on present-day Linux systems remains a black art.

The Solution

The ideal tracing system would cause zero performance degradation when it is disabled, would be dynamically enabled as needed, could collect data over an entire system, and would be safe to use on a production system. The paper describing DTrace, Dynamic Instrumentation of Production Systems, published in the USENIX 2004 Annual Technical Conference, earns itself a place on the Kernel Hacker's Bookshelf for describing the first system that lives up to this ideal.

DTrace was originally written for Solaris on both SPARC and x86, and has recently been ported to Mac OS X. I used DTrace extensively while I was working on Solaris and got used to being able to answer any question I had about a system with a few minutes of script writing. When I went back to work on Linux and could no longer use DTrace, I felt like I went from wielding a sharp steel katana to fumbling with dull flint tools. The only tool for Linux that comes close is SystemTap, which has improved significantly in the last year, though it still remains out of the mainline kernel.

I'm not the only person who thinks DTrace is ground-breaking. DTrace won the top award in the Wall Street Journal's 2006 Technology Awards. MIT's Technology Review named DTrace's lead engineer, Bryan Cantrill, as one of their 2005 TR35 winners, their list of top innovators under the age of 35. Any company with a half-decent marketing group can generate hype, but DTrace has garnered praise from both industry leaders and the people knuckling down to do the real work.

The Paper

The DTrace paper begins with the motivation for DTrace. For many years, Solaris developers, like Linux developers, focused on writing tools to help them in a kernel development environment. Then they began venturing out into the field to analyze real-world systems - and discovered that much of their toolkit was useless. Besides being impossible to use on production systems, their tools were designed to analyze processes or the kernel in isolation. They began to design a dynamic tracing system intended from its inception for use in production systems. It needed to be completely safe, have zero probe effect, aggregate data over the whole system, lose a minimum of trace data, and allow arbitrary instrumentation of any part of the system.

The architecture they came up with divides up the work of tracing into several modular components. The first is DTrace providers. These are kernel modules that know how to create and enable a particular class of DTrace probes. DTrace providers include things like function boundary tracing and virtual memory info tracing. When enabled, each DTrace probe has one or more series of actions associated with it that are executed by the DTrace framework (another kernel module) each time the probe fires, such as "Record the timestamp" or "Get the user stack of this thread." Actions can have predicates - conditions that must be met for the the action to be taken. This is one way to cut down on the amount of data that would otherwise be laboriously copied out of the kernel, only to be thrown away in post-processing. A useful predicate might be "Only if the pid is 7893" or "Only if the first argument is non-zero."

Probes are enabled by DTrace consumers - processes which tell the DTrace framework what probe points and actions they want to use. Probes can have multiple consumers. Each consumer has its own set of per-CPU buffers for transferring trace data out of the kernel, which is done is such a way that data is never corrupted, and the consumer is notified if data is lost. Many tracing systems silently drop data, which can lead to serious errors in analysis when an event is significantly under-sampled.

The most interesting and controversial part of DTrace is the scripting language, "D", and its conversion to the D Intermediate Format, DIF. Many developers don't understand why C and native machine code aren't preferable - after all, we already know C, and we have plenty of tools for compiling C into runnable machine code. Why reinvent the wheel? The answer comes in two parts.

First, D was invented to quickly form questions about a running system. A quote from the paper: "Our experience showed that D programs were rapidly developed and edited and often written directly on the dtrace(1M) command line." As such, it lends itself to a script-like language that is friendly to rapid prototyping. It is also intended primarily to gather and process data, and as such an awk or python-like structure was more appropriate. The language used to specify probe actions should be specialized for the task at hand, rather than simply reusing a language designed for generic system programming. At the same time, D is very similar to C (the paper describes D as "a companion language to C") and C programmers can quickly learn D.

Second, some level of emulation is needed for safety. Not all program errors can be caught in an initial pass; things like illegal dereferences must be caught and handled on the fly. The in-kernel DIF emulator is vital for the level of safety needed to use DTrace on a production system. When explaining to Linux developers the need to prevent buggy scripts from crashing the system, often the response is, "Well, don't do that." But imagine for a minute that you are debugging with SystemTap on your friend's production Linux server. When they ask you if it could possibly crash their system (which will cost them many thousands of dollars in lost business), you don't want to say, "Well, only if I have a bug in the scripts I am writing... on the fly... without code review... Um, how many thousands of dollars did you say?" A tracing system that can still cause the system to crash in some situations will be limited to kernel developers, students, and other people with the luxury of unscheduled downtime.

Two major components of DTrace remain: aggregations and speculative tracing, two methods of reducing trace data at the source, allowing far greater flexibility of tracing. The traditional method of tracing involves generating vast quantities of data, shoveling it out to user space as fast as possible, and then sifting through the detritus with post-processing scripts. The downsides of this approach are data loss (there is a limit to how quickly data can be copied out of the kernel), limitations on what we can trace (without excessive data loss), and expensive post-processing times. If we instead throw away or coalesce trace data at the source, our tracing is cheaper and more flexible.

One method of data pruning is aggregations, which coalesce a set of data into a useful summary. For example, with only a few lines of D, you can create an aggregation that collects a frequency distribution of the size of mmap function calls across all processes on the system. The alternative is copying out the entire set of trace data for each mmap call on the system, then writing a script to extract the sizes and calculate the distribution - which is slower, more error-prone, and has a much higher probe effect.

Speculative tracing is even more interesting; it allows a script to collect trace data and then decide whether to throw it away or pass it back up to user space. This is vital for collecting data for a common event, of which only a few events are judged "interesting" later on. For example, if you want to trace the entire call path of all system calls that result in a particular error code, you can speculatively trace each system call, but throw away the data for all system calls except the ones with the interesting error code.

If you don't have much time to read the DTrace paper, be sure to at least read Section 9, which describes a session root-causing a mysterious performance problem on a large server with hundreds of users. In the end, 6 instances of a stock ticker applet were putting so much load on the X server that killing them resulted in an increase in system idle time of 15% (!!!). More DTrace examples are available, linked to from the DTrace OpenSolaris web site.

What does this mean for Linux?

Hopefully anyone who saw Dave Jones' Why Userspace Sucks talk at OLS 2006 will already be excited about using SystemTap to track down problems. SystemTap is the current state of the art dynamic tracing system for Linux. It has little or no probe effect - performance degradation when it is disabled - and it can trace events across the system. However, it still has some way to go in the areas of safety, early data processing, and general usability. Understanding the DTrace paper will help people understand why these areas are important. More importantly, understanding the DTrace paper will help people understand how they can use SystemTap to solve interesting problems.

Bored? Lonely? Download SystemTap and start investigating performance problems today! If you're running FC4, you can even install SystemTap using yum.


(Log in to post comments)

KHB: Dynamic Instrumentation of Production Systems (a.k.a. DTrace)

Posted Sep 14, 2006 3:48 UTC (Thu) by davej (subscriber, #354) [Link]

> Whenever Dave Jones is trying to track down a memory allocation bug in
> Rawhide and turns on slab debugging, he's inundated with complaints about
> sluggish systems until he turns it back off again.

Actually SLAB_DEBUG isn't /that/ bad (or at least, people seem used to the hit when running rawhide. [Or maybe everyone has fast enough CPUs to make it not a big deal these days :) ]). The real killer that I get complaints from is CONFIG_DEBUG_PAGEALLOC. That one is only on as a last resort if I'm chasing something elusive that slab debug isn't shaking out.

KHB: Dynamic Instrumentation of Production Systems (a.k.a. DTrace)

Posted Sep 14, 2006 4:38 UTC (Thu) by maneesh_soni (subscriber, #7770) [Link]

> The only tool for Linux that comes close is SystemTap, which has improved significantly in the last year, though it still remains out of the mainline kernel

IMHO, the whole SystemTap tool should not reside in mainline kernel. kprobes (the underlying kernel component for SystemTap) is part of mainline for ~2 years. Probably there are some portions (please state if known) which can go in mainline kernel . Or does this statement just indicate that SystemTap is not yet mainstream tool for Linux kernel community?

KHB: Dynamic Instrumentation of Production Systems (a.k.a. DTrace)

Posted Sep 14, 2006 8:44 UTC (Thu) by mingo (subscriber, #31122) [Link]

SystemTap should be in the mainline kernel, because it's just so extremely useful when doing everyday kernel tracing and kernel debugging. The moment it's in the mainline kernel we wont need source-intrusive tracing frameworks anymore. (this means we can remove lots of debug cruft from the existing kernel tree, and we wont need to add new tracing cruft either.)

KHB: Dynamic Instrumentation of Production Systems (a.k.a. DTrace)

Posted Sep 14, 2006 6:03 UTC (Thu) by drosser (guest, #29597) [Link]

SystemTap has been a long time coming, but it IS coming. The problem has been that all the production servers are running RHEL, SLES, or perhaps even Debian Stable, none of which support SystemTap (officially) at the moment.

While we're all looking at SystemTap, I should point out a neat little project has even produced a GUI for SystemTap. http://stapgui.sourceforge.net/

KHB: Dynamic Instrumentation of Production Systems (a.k.a. DTrace)

Posted Sep 14, 2006 13:38 UTC (Thu) by rvfh (subscriber, #31018) [Link]

Just wondering, does Brandz provide some of the functionality, at least for processes?

KHB: Dynamic Instrumentation of Production Systems (a.k.a. DTrace)

Posted Sep 14, 2006 13:53 UTC (Thu) by richardm (subscriber, #7750) [Link]

Bryan Cantrill and his cohorts are certainly to be congratulated in raising awareness for this type of technology. But whether it can be claimed that they have earned a place in the "Kernel Hacker's Bookshelf for describing the first system that lives up to this ideal" is arguable. Preceding dtrace is a lineage of evolving technologies that go back to the dawn of commercial mainframes. Their paper references the immediately preceding work on Dynamic Probes for Linux, which was published 2001 at FREENIX. That work acknowledges its immediate predecessor, DTRACE for OS/2 (ca 1994) - yes that's right the name isn't even original. OS/2's DTRACE provided a comprehensive low-level scripting language that allow system-wide instrumentation to be applied to a running system. And that language permitted data gathering, rudimentary statistics and triggers for other debugging capabilities. DTRACE arose from an earlier implementation in OS/2 Version 1, the user interface to which was provided by a high-level scripting language and a language interpreter called TRCUST. Besides providing dynamic tracing, TRCUST also provided dynamic profiling. All of that has its origins in similar technologies developed in the 1970s. An example being the Dynamic Support System, which was part of IBM's OS/VS2 (precursor to MVS, great-grandfather of zOS) which ran of the IBM S/370 mainframe. But the story doesn't stop there. DSS came from RSS which was present in some embryonic predecessor to VS2 back in the 1960s.

The problem with debugging technologies is that they have always been, and still are to some extent, considered to be poor relations to "the real operating system features". Debugging has been regarded by many and having no rightful place in a production system. This view as a general statement is nonsense and Sun's DTRACE has greatly helped dispel that nonsense view. Once could say that it has done Linux a great favour by bringing the need to a broader public arena. The work done by Sun has added to the debate and the work already done IBM, SGI, HP, Intel, Red Hat, SuSE and many others in this arena on Linux. The infrastructure to support dynamic instrumentation was accepted into the Linux kernel during the 2.5 development cycle. We now have comparable tool to DTRACE - namely System Tap, that exploits this infrastructure.

Richard J Moore - IBM Linux Technology Center

KHB: Dynamic Instrumentation of Production Systems (a.k.a. DTrace)

Posted Sep 15, 2006 12:45 UTC (Fri) by zdzichu (subscriber, #17118) [Link]

Sorry, but SystemTap still isn't as stable as DTrace and can't trace userland. The amount of work which went in KProbes is, indeed, admirable.
But Sun after few years of work have production ready, useful product. And Linux people have not. Get that into your head.
Similar thing with ZFS. Solaris have it production ready, five years after first commit. Will we have something comparable in 2011? Doubtful.

KHB: Dynamic Instrumentation of Production Systems (a.k.a. DTrace)

Posted Sep 15, 2006 15:20 UTC (Fri) by anonymous21 (guest, #30106) [Link]

Who cares about ZFS anyway?

KHB: Dynamic Instrumentation of Production Systems (a.k.a. DTrace)

Posted Sep 24, 2006 21:59 UTC (Sun) by jamesd_wi (guest, #40720) [Link]

Who cares about ZFS anyway?

anyone who has ever used a LVM solution, anyone that knows in the future they may need to exceed the storage of a 64bit filesystem. Anyone that might want to take instaneous snapshots of there data.

It has veritas scared enough that it nows gives away its commercial filesystem that used to cost over $2000.

FreeBSD users are looking forward to having ZFS as well. (and are rapidly porting it.)

Even OSX users are thinking about getting ZFS.

Seems the only people that might claim that they don't care about ZFS is a linux zellot that tries to claim they don't want ZFS, but really they would love to have it but relize that their silly GPL license keep them from benefinting from ZFS's technology.

KHB: Dynamic Instrumentation of Production Systems (a.k.a. DTrace)

Posted Sep 15, 2006 18:17 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

The problem with debugging technologies is that they have always been, and still are to some extent, considered to be poor relations to "the real operating system features".

I think this stems from another concept: that it's wrong for production systems to have bugs. If you think by the time a system goes into production, it shouldn't have bugs, then diverting investment to production debugging tools feels wrong. It's a lot like distributing clean needles to illegal drug users and condoms in school.

I reached a conclusion a long time ago that this view is wrong. Bugs in the field should be expected and investment should be diverted from preventing bugs to making them easier to work around, diagnose, and repair in the field.

I'm seeing more and more diagnostics enabled by default, even with a runtime cost (memory, CPU time, disk space), indicating people are coming around to this view.

KHB: Dynamic Instrumentation of Production Systems (a.k.a. DTrace)

Posted Sep 17, 2006 15:18 UTC (Sun) by kreutzm (guest, #4700) [Link]

I agree to this view. When you run an evaluation of an IT product, it is quite common to find bugs (even in code shipping for some time already). You can either be proud about the bugs you've found or you can optimize your tools and get those bugs fixed quickly.

KHB: Dynamic Instrumentation of Production Systems (a.k.a. DTrace)

Posted Sep 14, 2006 22:13 UTC (Thu) by Tet (subscriber, #5433) [Link]

SystemTap is a wonderful tool, and has come to my rescue several times. It pointed out, for example, than an idle Sun JVM plugin in my web browser was responsible for over 80% of all the system calls being made on my machine, and was significantly hampering performance.

At the moment, though, it's a pain in the arse to get working on a non-Fedora machine, in part due to the lack of documentation, and in part due to the plain missing source code[1]. I long for the day that I can get it up and running on a CentOS Xen instance on an amd64 machine. It would help me track down some performance issues we have with one of our production machines.

The sooner it's shipped as standard with mainstream distributions, the better, as far as I'm concerned.

[1] Although I think this has now been fixed. Certainly in the past, the necessary version of elfutils was missing from the listed download site, which made compiling SystemTap yourself somewhat tricky to say the least...

KHB: Dynamic Instrumentation of Production Systems (a.k.a. DTrace)

Posted Sep 15, 2006 5:35 UTC (Fri) by dberkholz (subscriber, #23346) [Link]

It's already in at least Gentoo, and FC4, as pointed out elsewhere. That only leaves a few more.

KHB: Dynamic Instrumentation of Production Systems (a.k.a. DTrace)

Posted Sep 14, 2006 23:11 UTC (Thu) by dougm (guest, #4615) [Link]

Val, really interesting article, thanks. My one nit: the use of "root-cause" as a verb, though
concise, is IMHO a horrible barbarism. Please be kinder to the English language. :)

KHB: Dynamic Instrumentation of Production Systems (a.k.a. DTrace)

Posted Sep 15, 2006 3:02 UTC (Fri) by bronson (subscriber, #4806) [Link]

I disagree. I think using standard industry jargon keeps articles light and focused. LWN would be a duller place if they had to Strunk & White every submission.

I agree that it was a great article, of course.

standard?

Posted Sep 15, 2006 4:55 UTC (Fri) by roelofs (guest, #2599) [Link]

I think using standard industry jargon keeps articles light and focused.

Standard jargon, sure. But this?? I've been in "the industry" for 15 or 20 years, depending how you count, and I've never heard this usage before. My brain kept trying to parse it as some sort of security disaster ("Whoa, DTrace causes root exploits?!"). This kind of jargon we don't need.

LWN would be a duller place if they had to Strunk & White every submission.

See, now that's recognizable jargon--precise, informal, even humorous; it fits right in with LWN's overall style. And until today, all of Val's articles fit in well, too. Many of the contributed articles (and even a few of the regular staff's) don't quite live up to that standard, however. (One of the more egregious examples in recent weeks used commas in place of semicolons and periods--any high-school graduate should be capable of better than that!) Perhaps many of these problems are invisible to those who write the same way, but they're kind of jarring if you've grown accustomed to Jon's outstanding prose.

In short: clean, grammatical writing need not preclude either informality or humor. Spend some quality time with Mark Twain or Winston Churchill or even the Bard... Good examples abound.

Greg

standard?

Posted Sep 15, 2006 18:20 UTC (Fri) by bronson (subscriber, #4806) [Link]

Kernel: O miserable age! I tell thee, thy kernel is sullied. Tainted?? Nay, tis corruption hast brought perversion upon the noble state. The potent poison o'er crows my spirit. Be-netted round with IDE timeouts, thy tender servant halts here.

dies

Admin: Now crackst a noble heart. Goodbye sweet kernel. God knows when we shall meet again.

barbaric grammar

Posted Sep 15, 2006 4:37 UTC (Fri) by roelofs (guest, #2599) [Link]

My one nit: the use of "root-cause" as a verb, though concise, is IMHO a horrible barbarism. Please be kinder to the English language. :)

Amen, Brother! And thanks for clarifying what the author meant by that--my brain simply locked up on it; I couldn't, for the life of me, figure out how to parse it. (I actually thought it was the result of some ill-considered global search-and-replace operation. :-/ )

Greg

barbaric grammar

Posted Sep 15, 2006 18:06 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

I'm not even sure it's the right word. It looks like one of those naive management neologisms designed to make you think something that's been around forever is new. Lots of verbed words originate that way. From context, I believe the simpler, older term "diagnose" is what the article means.

The reason I suspect management is that the main difference between a root cause and just a cause is that the root cause helps you figure out where you can change a process to stop similar problems from happening in the future. Just for engineering, a more proximate cause is usually sufficient.

Detritus

Posted Sep 17, 2006 5:12 UTC (Sun) by ncm (subscriber, #165) [Link]

While we're at it... the junk data shipped to user space by kernel probes probably should not be called "detritus, but rather "jetsam".

KHB: Dynamic Instrumentation of Production Systems (a.k.a. DTrace)

Posted Sep 15, 2006 5:35 UTC (Fri) by meyert (subscriber, #32097) [Link]

Why exactly is oprofile unreliable?

Jarod Jenson, DTrace "Top Gun"

Posted Sep 15, 2006 16:40 UTC (Fri) by qu1j0t3 (guest, #25786) [Link]

A power user whose working life revolves around DTrace, Jarod recently started blogging war stories and reflections. (Credits: Jim Grisanzio, Chris Ratcliffe.)

KHB: Dynamic Instrumentation of Production Systems (a.k.a. DTrace)

Posted Sep 21, 2006 11:00 UTC (Thu) by pharm (guest, #22305) [Link]

"A tracing system that can still cause the system to crash in some situations will be limited to kernel developers, students, and other people with the luxury of unscheduled downtime."

Is it actually possible for a systemtap script to crash the kernel directly (without any embedded C in the script that is)? (trundles off to read the systemtap wiki).

Hmm: The systemtap documentation does say: "In practice, there are several weak points in systemtap and the underlying kprobes system at the time of writing. Putting probes indiscriminately into unusually sensitive parts of the kernel (low level context switching, interrupt dispatching) has reportedly caused crashes in the past. We are fixing these bugs as they are found, and constructing a probe point blacklist, but it is not complete."

I'll be willing to bet that DTrace had similar issues before public release though...

The 'all-or-nothing' nature of dropping compiled code (script->C) straight into the kernel does mean that the code could do anything at all, which in turn means that system tap has to be root-only: DTrace can have more fine-graining access restrictions due to the in-kernel script interpreter.

If they fix the 'can crash the system by tapping the wrong points' problems, then having the system tap script compiler be setuid and disallow embedded C for user-space scripts should allow the same kind of finer-grained permissions that DTrace allows (at the expense of the stap binary being a potential root-hole of course: the in-kernel interpreter will always be more secure).

KHB: Dynamic Instrumentation of Production Systems (a.k.a. DTrace)

Posted Sep 24, 2006 21:47 UTC (Sun) by jamesd_wi (guest, #40720) [Link]

a real comparison of systemtap vs. dtrace

Systemtap vs DTrace comparison

of course the real test of any tool is if anyone uses it. follow this link to see example links to many people using dtrace to solve problems, and how its being intergrating with other applications and languages to solve many problems.

What is DTrace

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds