Fun with tracepoints
LWN last looked at the tracepoint discussion in April. Since then, the disagreement has returned with little change. The catalyst this time was Mel Gorman's page allocator tracepoints patch, which further instruments the memory management layer. The mainline kernel already contains tracepoints for calls to functions like kmalloc(), kmem_cache_alloc(), and kfree(). Mel's patch adds tracepoints to the low-level page allocator, in places like free_pages_bulk(), __rmqueue_fallback(), and __free_pages(). These tracepoints give a view into how the page allocator is performing; they'll inform a suitably clueful user if fragmentation is growing or pages are being moved between processors. Also included is a postprocessing script which uses the tracepoint data to create a list of which processes on the system are putting the most stress on the memory management code.
As has happened before, Andrew Morton questioned the value of these tracepoints. He tends not to see the need for this sort of instrumentation, seeing it instead as debugging code which is generally useful to a single developer. Beyond that, Andrew asks, why can't the relevant information be added to /proc/vmstat, which is an established interface for the provision of memory management information to user space?
There are a couple of answers to that question. One is that /proc/vmstat has a number of limitations; it cannot be used, for example, to monitor the memory-management footprint of a specific set of processes. It is, in essence, pre-cooked information about memory management in the system as a whole; if a developer needs information which cannot be found there, that information will be almost impossible to get. Tracepoints, instead, provide much more specific information which can be filtered to give more precise views of the system. Mel bashed out one demonstration: a SystemTap script which uses the tracepoints to create a list of which processes are causing the most page allocations.
Ingo Molnar posted a lengthy set of examples of what could be done with tracepoints; some of these were later taken by Mel and incorporated into a document on simple tracepoint use. These examples merit a look; they show just how quickly and how far the instrumentation of the Linux kernel (and associated tools) have developed.
One of the key secrets for quick use of tracepoints is the perf tool which is shipped with the kernel as of 2.6.31-rc1. This tool was written as part of the performance monitoring subsystem; it can be used, for example, to run a program and report on the number of cache misses sustained during its execution. One of the features slipped into the performance counter subsystem was the ability to treat tracepoint events like performance counter events. One must set the CONFIG_EVENT_PROFILE configuration option; after that, perf can work with tracepoint events in exactly the same way it manages counter events.
With that in place, and a working perf binary, one can start by seeing which tracepoint events are available on the system:
$ perf list
...
ext4:ext4_sync_fs [Tracepoint event]
kmem:kmalloc [Tracepoint event]
kmem:kmem_cache_alloc [Tracepoint event]
kmem:kmalloc_node [Tracepoint event]
kmem:kmem_cache_alloc_node [Tracepoint event]
kmem:kfree [Tracepoint event]
kmem:kmem_cache_free [Tracepoint event]
ftrace:kmem_free [Tracepoint event]
...
How many kmalloc() calls are happening on a system? The question can be answered with:
$ perf stat -a -e kmem:kmalloc sleep 10
Performance counter stats for 'sleep 10':
4119 kmem:kmalloc
10.001645968 seconds time elapsed
So your editor's mostly idle system was calling kmalloc() almost 420 times per second. The -a option gives whole-system results, but perf can also look at specific processes. Monitoring allocations during the building of the perf tool gives:
$ perf stat -e kmem:kmalloc make
...
Performance counter stats for 'make':
5554 kmem:kmalloc
2.999255416 seconds time elapsed
More detail can be had be recording data and analyzing it afterward:
$ perf record -c 1 -e kmem:kmalloc make
...
$ perf report
# Samples: 6689
#
# Overhead Command Shared Object Symbol
# ........ ............... .................................... ......
#
19.43% make /lib64/libc-2.10.1.so [.] __getdents64
12.32% sh /lib64/libc-2.10.1.so [.] __execve
10.29% gcc /lib64/libc-2.10.1.so [.] __execve
7.53% cc1 /lib64/libc-2.10.1.so [.] __GI___libc_open
5.02% cc1 /lib64/libc-2.10.1.so [.] __execve
4.41% sh /lib64/libc-2.10.1.so [.] __GI___libc_open
3.45% sh /lib64/libc-2.10.1.so [.] fork
3.27% sh /lib64/ld-2.10.1.so [.] __mmap
3.11% as /lib64/libc-2.10.1.so [.] __execve
2.92% make /lib64/libc-2.10.1.so [.] __GI___vfork
2.65% gcc /lib64/libc-2.10.1.so [.] __GI___vfork
Conclusion: the largest source of kmalloc() calls in a simple compilation process is getdents(), called from make, followed by the execve() calls needed to run the compiler.
The perf tool can take things further; it can, for example, generate call graphs and disassemble the code around specific performance-relevant points. See Ingo's mail and Mel's document for more information. Even then, we're just talking about statistics on tracepoints; there is a lot more information available which can be used in postprocessing scripts or tools like SystemTap. Suffice to say that tracepoints open a lot of possibilities.
The obvious question is: was Andrew impressed by all this? Here's his answer:
As usual, we're adding tracepoints because we feel we must add tracepoints, not because anyone has a need for the data which they gather.
He suggested that he would be happier if the new tracepoints could be used to phase out /proc/vmstat and /proc/meminfo; that way there would not be a steadily-increasing variety of memory management instrumentation methods. Removing those files is problematic for a couple of reasons, though. One is that they form part of the kernel ABI, which is not easily broken. It would be a multi-year process to move applications over to a different interface and be sure there were no more users of the /proc files. Beyond that, though, tracepoints are good for reporting events, but they are a bit less well-suited to reporting the current state of affairs. One can use a tracepoint to see page allocation events, but an interface like /proc/vmstat can be more straightforward if one simply wishes to know how many pages are free. There is space, in other words, for both styles of instrumentation.
As of this writing, nobody has made a final pronouncement on whether the
new tracepoints will be merged. Andrew has made it clear, though, that,
despite his concerns, he's not firmly opposing them. There is enough
pressure to get better instrumentation into the kernel, and enough useful
things to do with that instrumentation, that, one assumes, more of it will
go into the mainline over time.
| Index entries for this article | |
|---|---|
| Kernel | Development tools/Kernel tracing |
| Kernel | Performance monitoring |
| Kernel | Tracing |
