User: Password:
|
|
Subscribe / Log in / New account

Fun with tracepoints

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

By Jonathan Corbet
August 12, 2009
Tracepoints are a marker within the kernel source which, when enabled, can be used to hook into a running kernel at the point where the marker is located. They can be used by a number of tools for kernel debugging and performance problem diagnosis. One of the advantages of the DTrace system found in Solaris is the extensive set of well-documented tracepoints in the kernel (and beyond); they allow administrators and developers to monitor many aspects of system behavior without needing to know much about the kernel itself. Linux, instead, is rather late to the tracepoint party; mainline kernels currently feature only a handful of static tracepoints. Whether that number will grow significantly is still a matter of debate within the development community.

LWN last looked at the tracepoint discussion in April. Since then, the disagreement has returned with little change. The catalyst this time was Mel Gorman's page allocator tracepoints patch, which further instruments the memory management layer. The mainline kernel already contains tracepoints for calls to functions like kmalloc(), kmem_cache_alloc(), and kfree(). Mel's patch adds tracepoints to the low-level page allocator, in places like free_pages_bulk(), __rmqueue_fallback(), and __free_pages(). These tracepoints give a view into how the page allocator is performing; they'll inform a suitably clueful user if fragmentation is growing or pages are being moved between processors. Also included is a postprocessing script which uses the tracepoint data to create a list of which processes on the system are putting the most stress on the memory management code.

As has happened before, Andrew Morton questioned the value of these tracepoints. He tends not to see the need for this sort of instrumentation, seeing it instead as debugging code which is generally useful to a single developer. Beyond that, Andrew asks, why can't the relevant information be added to /proc/vmstat, which is an established interface for the provision of memory management information to user space?

There are a couple of answers to that question. One is that /proc/vmstat has a number of limitations; it cannot be used, for example, to monitor the memory-management footprint of a specific set of processes. It is, in essence, pre-cooked information about memory management in the system as a whole; if a developer needs information which cannot be found there, that information will be almost impossible to get. Tracepoints, instead, provide much more specific information which can be filtered to give more precise views of the system. Mel bashed out one demonstration: a SystemTap script which uses the tracepoints to create a list of which processes are causing the most page allocations.

Ingo Molnar posted a lengthy set of examples of what could be done with tracepoints; some of these were later taken by Mel and incorporated into a document on simple tracepoint use. These examples merit a look; they show just how quickly and how far the instrumentation of the Linux kernel (and associated tools) have developed.

One of the key secrets for quick use of tracepoints is the perf tool which is shipped with the kernel as of 2.6.31-rc1. This tool was written as part of the performance monitoring subsystem; it can be used, for example, to run a program and report on the number of cache misses sustained during its execution. One of the features slipped into the performance counter subsystem was the ability to treat tracepoint events like performance counter events. One must set the CONFIG_EVENT_PROFILE configuration option; after that, perf can work with tracepoint events in exactly the same way it manages counter events.

With that in place, and a working perf binary, one can start by seeing which tracepoint events are available on the system:

    $ perf list
      ...
      ext4:ext4_sync_fs                        [Tracepoint event]
      kmem:kmalloc                             [Tracepoint event]
      kmem:kmem_cache_alloc                    [Tracepoint event]
      kmem:kmalloc_node                        [Tracepoint event]
      kmem:kmem_cache_alloc_node               [Tracepoint event]
      kmem:kfree                               [Tracepoint event]
      kmem:kmem_cache_free                     [Tracepoint event]
      ftrace:kmem_free                         [Tracepoint event]
      ...

How many kmalloc() calls are happening on a system? The question can be answered with:

    $ perf stat -a -e kmem:kmalloc sleep 10

     Performance counter stats for 'sleep 10':

           4119  kmem:kmalloc            

     10.001645968  seconds time elapsed

So your editor's mostly idle system was calling kmalloc() almost 420 times per second. The -a option gives whole-system results, but perf can also look at specific processes. Monitoring allocations during the building of the perf tool gives:

    $ perf stat -e kmem:kmalloc make
      ...
 Performance counter stats for 'make':

           5554  kmem:kmalloc            

  2.999255416  seconds time elapsed

More detail can be had be recording data and analyzing it afterward:

    $ perf record -c 1 -e kmem:kmalloc make
      ...
    $ perf report
    # Samples: 6689
    #
    # Overhead          Command                         Shared Object  Symbol
    # ........  ...............  ....................................  ......
    #
      19.43%             make  /lib64/libc-2.10.1.so                 [.] __getdents64
      12.32%               sh  /lib64/libc-2.10.1.so                 [.] __execve
      10.29%              gcc  /lib64/libc-2.10.1.so                 [.] __execve
       7.53%              cc1  /lib64/libc-2.10.1.so                 [.] __GI___libc_open
       5.02%              cc1  /lib64/libc-2.10.1.so                 [.] __execve
       4.41%               sh  /lib64/libc-2.10.1.so                 [.] __GI___libc_open
       3.45%               sh  /lib64/libc-2.10.1.so                 [.] fork
       3.27%               sh  /lib64/ld-2.10.1.so                   [.] __mmap
       3.11%               as  /lib64/libc-2.10.1.so                 [.] __execve
       2.92%             make  /lib64/libc-2.10.1.so                 [.] __GI___vfork
       2.65%              gcc  /lib64/libc-2.10.1.so                 [.] __GI___vfork

Conclusion: the largest source of kmalloc() calls in a simple compilation process is getdents(), called from make, followed by the execve() calls needed to run the compiler.

The perf tool can take things further; it can, for example, generate call graphs and disassemble the code around specific performance-relevant points. See Ingo's mail and Mel's document for more information. Even then, we're just talking about statistics on tracepoints; there is a lot more information available which can be used in postprocessing scripts or tools like SystemTap. Suffice to say that tracepoints open a lot of possibilities.

The obvious question is: was Andrew impressed by all this? Here's his answer:

So? The fact that certain things can be done doesn't mean that there's a demand for them, nor that anyone will _use_ this stuff.

As usual, we're adding tracepoints because we feel we must add tracepoints, not because anyone has a need for the data which they gather.

He suggested that he would be happier if the new tracepoints could be used to phase out /proc/vmstat and /proc/meminfo; that way there would not be a steadily-increasing variety of memory management instrumentation methods. Removing those files is problematic for a couple of reasons, though. One is that they form part of the kernel ABI, which is not easily broken. It would be a multi-year process to move applications over to a different interface and be sure there were no more users of the /proc files. Beyond that, though, tracepoints are good for reporting events, but they are a bit less well-suited to reporting the current state of affairs. One can use a tracepoint to see page allocation events, but an interface like /proc/vmstat can be more straightforward if one simply wishes to know how many pages are free. There is space, in other words, for both styles of instrumentation.

As of this writing, nobody has made a final pronouncement on whether the new tracepoints will be merged. Andrew has made it clear, though, that, despite his concerns, he's not firmly opposing them. There is enough pressure to get better instrumentation into the kernel, and enough useful things to do with that instrumentation, that, one assumes, more of it will go into the mainline over time.


(Log in to post comments)

Fun with tracepoints

Posted Aug 13, 2009 2:45 UTC (Thu) by karim (subscriber, #114) [Link]

Maybe I'm just missing something but ... I can't help but feel somewhat cynical about this. I released the Linux Trace Toolkit on July 22nd 1999, over 10 years ago - and yet still, this was quite some time before DTrace came to be FWIW. The value of static tracepoints seemed obvious at the time (for me at least). I can't believe this debate is still going on. In fact, I can't help but think that Linux in this case could have been way much further ahead of DTrace. Of course the worst part is that early on all this static tracing was turned down because it would result in unmaintainable bloat. The irony is that the vast majority of initial trace-points suggested are still valid today.

As to the argument that nobody wants to "use" this stuff, I've never bought this. You can't expect users to come asking for tools they've never seen before -- that's rare. That doesn't mean they won't find those tools very useful if they were made available to them. It just so happens that those feeling the need for these tools have no way to show mass user adoption of these tools because they can never get those tools to the users in the 1st place (if it's not mainlined, it's likely not if your latest distro ...) So one can only point at other OSes delivering same functionality ... The fact of the matter is that in this specific case users don't get enough credit for acting smart when given the right information. Even Windows allows me to get more information about what's going on than Linux does. There has to be a point where users are given the tools to find out what *they* want to know about what's going on, not what some maintainer somewhere decides they should see through /proc/foo.

I seriously hope this issue can be settled to Linux's benefit at some point in time in the future. Though I've stopped maintaining LTT quite a few years ago, I still hope one day being able to have a tool in Ubuntu to have the tools that give me the power to get full control over the information *I* want to see.

Karim Yaghmour

Fun with tracepoints

Posted Aug 13, 2009 7:52 UTC (Thu) by michaeljt (subscriber, #39183) [Link]

One of the arguments I've heard against adding static trace points is that they become part of the kernel ABI and can no longer be removed if they prove not to be useful. Perhaps a way around this argument would be to initially add tracepoints with a mangled name that depends on the current kernel release? That way, until a given set of tracepoints have proved themselves, any scripts using them would have to be updated with every kernel release in order to keep working. The mangling needn't be very complex (e.g. all tracepoints in 2.6.32 which might be removed at a later time could have "dot32" pre-pended to their names), it just needs to be unpredictable in future kernel releases.

Fun with tracepoints

Posted Aug 13, 2009 9:59 UTC (Thu) by addw (guest, #1771) [Link]

I don't see a problem with making it clear that static trace points are NOT part of the ABI, ie that they may come & go. If you are getting that close to the kernel you have got to expect things to change. But how do people use traces ? Probably to look at particular problems, not as general monitoring.

In practice most people use Distro X, version Y. The release people for this will ensure that trace points don't get removed during the 5/... year lifetime of Y, thus you can install new distro provided kernels without worrying. When you rebuild your machine in 5 years time you redo your traces.

LTTng

Posted Aug 13, 2009 8:33 UTC (Thu) by alex (subscriber, #1355) [Link]

FWIW I've used your LTT code on embedded systems and found it very useful in understanding the pseudo-realtime behaviour of my system. The optimist in me likes to think when we finally have a full tracing solution in the kernel it will be a much more powerful and refined experience for the 10 years of experience and experimenting done leading up to the final solution.

Fun with tracepoints

Posted Aug 13, 2009 12:20 UTC (Thu) by fuhchee (subscriber, #40059) [Link]

As to the argument that nobody wants to "use" this stuff, I've never bought this. You can't expect users to come asking for tools they've never seen before -- that's rare. That doesn't mean they won't find those tools very useful if they were made available to them.

I think it goes even beyond that. The very fact that certain subsystem maintainers have found certain tracepoint suites already useful to themselves does not seem to carry any weight. A satisfactory burden/level of proof is not stated, just a counterfactual caricature "I don't think anyone needs this".

Fun with tracepoints

Posted Aug 13, 2009 16:24 UTC (Thu) by SEJeff (subscriber, #51588) [Link]

Why don't you (as a systemtap developer) get a list of random joes, developers, and actual subsystem maintainers to write a small blurb mentioning how static tracepoints helped them out?

Then you can say look, these are not random uses? Remember all of the push against the memleak detector stuff inkernel? It seems to have pulled it's weight already by helping find plenty of bugs.

Fun with tracepoints

Posted Aug 13, 2009 18:18 UTC (Thu) by karim (subscriber, #114) [Link]

Sorry, this has been tried and has failed. Check out the list of companies who have contributed to LTTng:
Google, IBM, Ericsson, Autodesk, Wind River, Fujitsu, Monta Vista, STMicroelectronics, C2 Microsystems, Sony, Siemens, Nokia.

But, hey, who are they to know, the kernel developers know better. And the hell with Sun, Apple, IBM, Microsoft, etc. who spent large gobs of money on implementing tracing infrastructure in their OSes (Apple by the way ported DTrace to MacOS ... :/ ) and maintaining it through the years. They're wrong too. The Linux kernel developers surely are better than the collective intelligence of the engineers and product managers of the aforementioned.

I forgot to mention that apart from pushing and maintaining LTT for a number of years, I also worked/defended a number of ideas which were dear to my heart. Take for example real-time. Very early on I came to the LKML pointing out that the tacit laissez-faire towards the RTLinux patent was not good for Linux. This was dismissed off-hand: the uses, I was told, were so narrow and the applications so specific that this is a non-issue ("real time apps are a niche market and they're not mainstream" ... i.e. those users don't matter). Skip a few years and there were two approaches being discussed Ingo's and the iPipe (my idea); at the subsequent OLS I asked a prominent developer whether what he thought were the chances of success of Ingo's very invasive approach, his reply was clear: Ingo has got the clout to make it happen. Just about then I knew iPipe wasn't likely to "win". And have his lunch he did.

That along with other things I witnessed (such as Con Kolivas quitting kernel development because he saw little interest in helping desktop interactivity) made me increasingly feel there's a NIH-syndrome. If nothing else, it distills from this that Linux' development has become highly politicized. You're either part of the in-crowd or you're not. And if you're not part of the in-crowd you're going to have a hell of a time trying to push something in if it's the least bit unconventional. Don't get wrong, being part of the in-crowd doesn't guarantee a radical change's acceptance. But being an outsider clearly ensures that you've got zero chances of success. It might have changed since I've stopped keeping track of it all, but juxtapose the previous with the fact that most kernel developers work for/on big iron and you've got a huge disconnect with the realities of real-life mainstream users. It's not that user preoccupations aren't eventually taken care or fixed (ex.: udev/sysfs/devfs), it's just that an absolute non-priority. And *that* is a serious issue. Last I checked, Linux has been flatlining in the end-user market for a very long time. If the diagnosis I'm making out of the symptoms I've witnessed is the least bit right (and I really hope I'm wrong), this isn't about to change any time soon.

I sincerely apologize if I've offended anyone with the above, but this is a case where *everything* ***EVERYTHING*** has been tried to convince the kernel development community. The ball is in their camp.

Fun with tracepoints

Posted Aug 18, 2009 18:33 UTC (Tue) by karim (subscriber, #114) [Link]

Just so there's no misunderstanding, please note that I don't speak on behalf on LTTng in way shape or form; the above opinions are mine and mine alone. LTTng and the now defunct LTT, which I used to maintain, have nothing but part of the name in common. I could have used any other of the tracing projects as an example, it just so happens that this is the one I'm most familiar with :)

Karim

Fun with tracepoints

Posted Aug 18, 2009 21:40 UTC (Tue) by oak (guest, #2786) [Link]

> If nothing else, it distills from this that Linux' development has
> become highly politicized. You're either part of the in-crowd or
> you're not.

I think the problem is more that individual kernel developers don't really
(need to) look at the whole system or be responsible for it, just a one
corner of it. On commercial operating systems, there are dedicated people
who look after the whole thing and need to make sure that the whole thing
works fine (and this responsibility gives them influence over the
operating system implementation to make sure that these tools get done &
available).

If you're looking just at one or some parts of the whole system, things
like LTT (or to some extent Systemtap[1]) that try to get an overview of
what happens in the whole system may seem too large / complex /
intrusive / bloated. "I just need this specific info from the block
layer" (or memory subsystem, or ...). And then they write their own NIH
tracing for that single thing that doesn't much benefit others, or
somebody who wants to make sense out of the whole system.

Note: I have gotten useful info both from LTT and LTTng (lttng.org) + it's
finally getting easier to apply to kernel... LTTv plays a large part in
this too as one can easily zoom into details etc.

[1] Systemtap seems nice, but it doesn't have the post-processing /
visualization for the whole system like LTT does. I see it more like a
tool to do more specific analysis tools. However, for this kind of stuff
it's a bit too complicated (e.g. in embedded environments where you don't
want to run stap / compile the scripts on the device itself etc), so no
wonder devs write their own tracing...

Fun with tracepoints

Posted Aug 19, 2009 16:18 UTC (Wed) by fuhchee (subscriber, #40059) [Link]

[1] Systemtap seems nice, but it doesn't have the post-processing / visualization for the whole system like LTT does. I see it more like a tool to do more specific analysis tools.

I see what you mean. systemtap people are working on some GUI data graphing tools, but are just starting. (I got the impression though that LTTV was being deprecated in favour of eclipse-based widgets, which systemtap and other tools could feed data into also.)

However, for this kind of stuff it's a bit too complicated (e.g. in embedded environments where you don't want to run stap / compile the scripts on the device itself etc)

We hope to ease that pain by more automated cross-compilation/execution.

Fun with tracepoints

Posted Aug 20, 2009 19:29 UTC (Thu) by oak (guest, #2786) [Link]

> I got the impression though that LTTV was being deprecated in favour of
eclipse-based widgets

Do you have any pointers to more information about this?

Fun with tracepoints

Posted Aug 20, 2009 19:32 UTC (Thu) by fuhchee (subscriber, #40059) [Link]

I don't want to misrepresent LTTng, so please do take all this
with a grain of salt, but this is what I gathered from the presentations
given at http://ltt.polymtl.ca/tracingwiki/index.php/TracingMiniSu...

Fun with tracepoints

Posted Aug 20, 2009 19:34 UTC (Thu) by fuhchee (subscriber, #40059) [Link]

Fun with tracepoints

Posted Aug 20, 2009 20:58 UTC (Thu) by oak (guest, #2786) [Link]

Thanks! According to this:
http://eclipse.org/linuxtools/projectPages/lttng/

"The first release, scheduled for September 2009 (code name: Vanilla),
will provide feature parity with the LTTng Viewer (LTTV) v0.12.12."

And this seemed to have a screenshot of the LTTng plugin:
http://ltt.polymtl.ca/tracingwiki/images/0/00/TMF_-_Traci...

This was pretty good overview of past & present tracing:
http://ltt.polymtl.ca/tracingwiki/images/5/57/Ts2009-hell...

Architecture bit here was annoying:
http://ltt.polymtl.ca/tracingwiki/images/4/46/Ts2009-Syst...

As it assumes that one has a working user-space in problematic cases one
wants to analyze. Kernel should be able to optionally get the data out
also through some high-speed HW interface without going through user-space
and filling of the flight record buffer would better not rely on
user-space.

Fun with tracepoints

Posted Aug 22, 2009 16:23 UTC (Sat) by compudj (subscriber, #43335) [Link]

About your comment on the architecture, I just want to clarify a few points. First, the architecture diagram you see at http://ltt.polymtl.ca/tracingwiki/images/4/46/Ts2009-Syst... focuses on tracing of userland. In this diagram, kernel tracing is contained within the "kernel trace facilities" box. For user-space tracing, where the goal is to get data out of the applications, it makes sense to consider than user-space is working. As you point out, this assumption makes less sense when we talk about tracing the kernel.

Second, more specifically about the LTTng kernel tracer, you are right in that the current mechanism used to extract data is a splice() system call controlled by a user-space daemon. However, alternate implementations of ltt-relay-alloc.c and ltt-relay-lockless.c could easily permit to use a high-speed debug interface. This has already been done with earlier LTTng versions for ARM.

The core of the LTTng kernel tracer therefore does not depend on userland. It's only the peripheral data extraction and trace control modules which depend on working userland. But they could be replaced easily by built-in kernel objects interacting directly with the LTTng kernel API. I made sure all operations we allow from interfaces presented to user-space are also doable from within the kernel.

Mathieu


Copyright © 2009, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds