|
|
Subscribe / Log in / New account

Kernel analysis with bpftrace

July 18, 2019

This article was contributed by Brendan Gregg

At the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM) I gave a keynote on BPF observability that included a kernel issue I had debugged on Netflix production servers using bpftrace. In this article I'll provide a crash course on bpftrace for kernel developers—to help them more easily analyze their code.

I was recently discussing tcp_sendmsg() with another developer who was concerned about large message sizes (such as 100 megabytes) causing failures. 100 MB?? I doubt Netflix is sending messages anywhere near that large in production. Many of you may know the function prototype well, it is (from net/ipv4/tcp.c):

    int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);

bpftrace is already installed on Netflix (and other companies) production servers, so I ssh-ed to a busy server to check the size distribution of messages for ten seconds:

    # bpftrace -e 'k:tcp_sendmsg { @size = hist(arg2); } interval:s:10 { exit(); }'
    Attaching 2 probes...
    
    @size: 
    [4, 8)                25 |                                                    |
    [8, 16)               74 |                                                    |
    [16, 32)               5 |                                                    |
    [32, 64)           13603 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    [64, 128)           2527 |@@@@@@@@@                                           |
    [128, 256)            21 |                                                    |
    [256, 512)           181 |                                                    |
    [512, 1K)           1587 |@@@@@@                                              |
    [1K, 2K)            2992 |@@@@@@@@@@@                                         |
    [2K, 4K)            9367 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                 |
    [4K, 8K)           12461 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     |
    [8K, 16K)            995 |@@@                                                 |
    [16K, 32K)          1977 |@@@@@@@                                             |
    [32K, 64K)           428 |@                                                   |
    [64K, 128K)           14 |                                                    |
    [128K, 256K)           2 |                                                    |

The largest is in the 128 to 256KB range. This is using a kprobe ("k") on tcp_sendmsg(), and saving a histogram of arg2 (size) to a BPF map named "@size" (the name is unimportant, I'm just using it to annotate the output). An interval event fires after 10 seconds and exits, at which point all BPF maps are printed.

What about errors?:

    # bpftrace -e 'kr:tcp_sendmsg { @retvals[retval > 0 ? 0 : retval] = count(); }
        interval:s:10 { exit(); }'
    Attaching 2 probes...
    
    @retvals[0]: 49794

No errors. This time I am using a kretprobe ("kr"), and I'm frequency counting retval, which is either a negative error code or the size. Since I don't care about the size, I used a ternary operator to set all positive values to zero.

bpftrace lets you answer these questions immediately, in production, and without needing debuginfo (Netflix does not typically have it installed). For this case, it showed that Netflix was unlikely to see the large tcp_sendmsg() problem for this workload.

Other one-liners

The previous tcp_sendmsg() one-liners demonstrate some bpftrace kernel analysis capabilities. To show you some more capabilities, here are several more one-liners that also use tcp_sendmsg() as the example. You can imagine switching it to other kernel functions.

Show tcp_sendmsg() sizes bigger than 8192 bytes, per-event:

    bpftrace -e 'k:tcp_sendmsg /arg2 > 8192/ { printf("PID %d: %d bytes\n", pid, arg2); }'

Show a histogram of request size for each process (PID and comm):

    bpftrace -e 'k:tcp_sendmsg { @size[pid, comm] = hist(arg2); }'

Frequency count return values:

    bpftrace -e 'kr:tcp_sendmsg { @return[retval] = count(); }'

Show per-second statistics: event count, average size, and total bytes:

    bpftrace -e 'k:tcp_sendmsg { @size = stats(arg2); }
        interval:s:1 { print(@size); clear(@size); }'

Count calling stack traces:

    bpftrace -e 'k:tcp_sendmsg { @[kstack] = count(); }'

Count calling stack traces, three levels deep:

    bpftrace -e 'k:tcp_sendmsg { @[kstack(3)] = count(); }'

Show function latency as a histogram, in nanoseconds:

    bpftrace -e 'k:tcp_sendmsg { @ts[tid] = nsecs; }
        kr:tcp_sendmsg /@ts[tid]/ { @ns = hist(nsecs - @ts[tid]); delete(@ts[tid]); }'

This last example is saving a timestamp in one probe (keyed on thread ID) and retrieving it in another. This pattern can be used for custom latency measurements.

struct data

There is an important capability missing from those one-liners: struct navigation. Here is the function prototype again:

    int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);

bpftrace provides arg0-argN for kprobe function arguments, simply mapping them to the registers for the calling convention (arg2 becomes %rdx on x86_64, for example). Since bpftrace can read kernel headers, which are often installed on production systems, accessing struct data is possible by including the right header and casting the arguments:

    #include <net/sock.h>
    [...]
            $sk = (struct sock *)arg0;

Here's an example of a bpftrace tool that prints the address information, size, and return value from tcp_sendmsg(). Example output:

    # ./tcp_sendmsg.bt 
    Attaching 2 probes...
    10.0.0.65       49978 -> 52.37.243.173   443  : 63 bytes, retval 63
    127.0.0.1       58566 -> 127.0.0.1       22   : 36 bytes, retval 36
    127.0.0.1       22    -> 127.0.0.1       58566: 36 bytes, retval 36
    [...]

The source of tcp_sendmsg.bt:

    #!/usr/local/bin/bpftrace
    
    #include <net/sock.h>
    
    k:tcp_sendmsg
    {
	    @sk[tid] = arg0;
	    @size[tid] = arg2;
    }
    
    kr:tcp_sendmsg
    /@sk[tid]/
    {
	    $sk = (struct sock *)@sk[tid];
	    $size = @size[tid];
	    $af = $sk->__sk_common.skc_family;
	    if ($af == AF_INET) {
		    $daddr = ntop($af, $sk->__sk_common.skc_daddr);
		    $saddr = ntop($af, $sk->__sk_common.skc_rcv_saddr);
		    $lport = $sk->__sk_common.skc_num;
		    $dport = $sk->__sk_common.skc_dport;
		    $dport = ($dport >> 8) | (($dport << 8) & 0xff00);
		    printf("%-15s %-5d -> %-15s %-5d: %d bytes, retval %d\n",
		        $saddr, $lport, $daddr, $dport, $size, retval);
	    } else {
		    printf("IPv6...\n");
	    }
	    delete(@sk[tid]);
	    delete(@size[tid]);
    }

In the kprobe, sk and size are saved in per-thread-ID maps, so they can be retrieved in the kretprobe when tcp_sendmsg() returns. The kretprobe casts sk and prints out details, if it is an IPv4 message, using the bpftrace function ntop() to convert the address to a string. The destination port is flipped from network to host order. To keep this short I skipped IPv6, but you can add code to handle it too (ntop() does support IPv6 addresses).

There is work underway for bpftrace to use BPF Type Format (BTF) information as well, which brings various advantages including struct definitions that are missing from kernel headers.

Advanced example

So far I've shown straightforward tracing capabilities. For an advanced example, I'll demonstrate off-CPU profiling.

CPU profiling of a task is typically easy: I can sample stacks, inspect performance-monitoring counters (PMCs) and model-specific registers (MSRs) to see what was running on-CPU and why it was slow. Off-CPU profiling, on the other hand, has been problematic. I can show blocking stacks on context switches, but they often show that the task is blocked on something else (select(), epoll(), or a lock.) I need to know what that something else is.

BPF has provided a solution at long last to this problem: the ability to save stack traces and retrieve them later (something I've always wanted DTrace to do, but it cannot). Here is a bpftrace version of the solution, which shows the process name, blocking stack, waker stack, and a blocking-time histogram in microseconds:

   # ./offwake.bt
   Attaching 4 probes...
   Tracing off-CPU time (us) with waker stacks. Ctrl-C to end.
   ^C
   [...]
   
   @us[ssh,
       finish_task_switch+1
       schedule+44
       schedule_hrtimeout_range_clock+185
       schedule_hrtimeout_range+19
       poll_schedule_timeout+69
       do_select+1378
       core_sys_select+471
       sys_select+183
       do_syscall_64+115
       entry_SYSCALL_64_after_hwframe+61
   ,
       try_to_wake_up+1
       pollwake+115
       __wake_up_common+115
       __wake_up_common_lock+128
       __wake_up_sync_key+30
       sock_def_readable+64
       tcp_rcv_established+1281
       tcp_v4_do_rcv+144
       tcp_v4_rcv+2423
       ip_local_deliver_finish+98
       ip_local_deliver+111
       ip_rcv_finish+297
       ip_rcv+655
       __netif_receive_skb_core+1074
       __netif_receive_skb+24
       netif_receive_skb_internal+69
       napi_gro_receive+197
       ieee80211_deliver_skb+200
       ieee80211_rx_handlers+5376
       ieee80211_prepare_and_rx_handle+1865
       ieee80211_rx_napi+914
       iwl_mvm_rx_rx_mpdu+1205
       iwl_mvm_rx+77
       iwl_pcie_rx_handle+571
       iwl_pcie_irq_handler+1577
       irq_thread_fn+38
       irq_thread+325
       kthread+289
       ret_from_fork+53
   ]:
   [64, 128)              1 |@@                                                  |
   [128, 256)             1 |@@                                                  |
   [256, 512)             1 |@@                                                  |
   [512, 1K)              1 |@@                                                  |
   [1K, 2K)               4 |@@@@@@@@@@                                          |
   [2K, 4K)              10 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                         |
   [4K, 8K)              18 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   |
   [8K, 16K)              3 |@@@@@@@@                                            |
   [16K, 32K)            16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
   [32K, 64K)            19 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
   [64K, 128K)            3 |@@@@@@@@                                            |

I've included one stack pair in this output. This shows an ssh process with a blocking stack in select(). The waker stack reveals what it was waiting for: a network packet. The histogram shows the durations for this offcpu/waker path, in microseconds.

The source to offwake.bt is:

   #!/usr/local/bin/bpftrace
   
   #include <linux/sched.h>
   
   BEGIN
   {
	   printf("Tracing off-CPU time (us) with waker stacks. Ctrl-C to end.\n");
   }
   
   kprobe:try_to_wake_up
   {
	   $p = (struct task_struct *)arg0;
	   @waker[$p->pid] = kstack;
   }
   
   kprobe:finish_task_switch
   {
	   $prev = (struct task_struct *)arg0;
   
	   // record timestamp of sleeping task:
	   @ts[$prev->pid] = nsecs;
   
	   if (@ts[tid]) {
		   $offcpu_us = (nsecs - @ts[tid]) / 1000;
		   @us[comm, kstack, @waker[tid]] = hist($offcpu_us);
		   delete(@ts[tid]);
		   delete(@waker[tid]);
	   }
   }
   
   END
   {
	   clear(@waker);
	   clear(@ts);
   }

The kernel stack on try_to_wake_up() is saved with the task ID, which is later retrieved on finish_task_switch(). This is a simple bpftrace version of the offwaketime tool from BCC, and the samples/bpf/offwaketime* examples in the kernel source.

I explained this problem and the BPF solution in my Performance@Scale talk (slides [Slideshare], video), showing how these stack traces can be visualized as flame graphs. Sometimes you need to show who woke the waker, and who woke them, and so on. By walking the wakeup chain, I can construct a "chain graph" to show the original source of the latency (usually interrupts).

Tracepoints

The previous examples used kprobes, and these will need updating as the kernel changes. Tracepoints are much preferable to kprobes for bpftrace programs, even though they are "best effort" and might change. That's still better than kprobes, which can change anytime, or show up as missing because they were inlined. However, some kprobes are more stable than others, particularly those for internal kernel interfaces, such as VFS, struct file_operations, struct proto, etc.

Just as a simple example, here are the arguments provided by the timer:hrtimer_start tracepoint:

   # bpftrace -lv 't:timer:hrtimer_start'
   tracepoint:timer:hrtimer_start
       void * hrtimer;
       void * function;
       s64 expires;
       s64 softexpires;

Frequency counting the function argument:

   # bpftrace -e 't:timer:hrtimer_start { @[ksym(args->function)] = count(); }'
   Attaching 1 probe...
   ^C
   
   @[sched_rt_period_timer]: 6
   @[watchdog_timer_fn]: 16
   @[intel_uncore_fw_release_timer]: 290
   @[it_real_fn]: 376
   @[hrtimer_wakeup]: 12301
   @[tick_sched_timer]: 36433

The tracepoint arguments are accessible via args. In this case, I used the ksym() bpftrace function to return the kernel symbol name for the address.

More examples and information

I stepped through a Netfilx production example in my LSFMM keynote (slides [Slideshare]). More examples of programs can be found in the bpftrace repository under tools. At LSFMM, I previewed more tools I had developed for an upcoming BPF analysis book for Addison-Wesley; they will be made available on the BPF book page.

To try bpftrace yourself, see the INSTALL guide, and get the latest version: 0.9.1 or newer. There are packages for different distributions, companies including Netflix and Facebook have their own internal package, or you can build from source. bpftrace currently uses LLVM and Clang for dependencies (as does BCC), although a future version may make them optional.

Also see the bpftrace Cheat Sheet for a summary of the language as well as the full bpftrace Reference Guide.

What about perf, ftrace?

I use the best tool for the job, and many times that's not bpftrace: it is perf or ftrace. Examples:

  • Frequency counting many functions (e.g, tcp_*): ftrace, because it's fast to initialize. BPF kprobe attaching will become faster with multi-attach perf_event_open() once it is developed.
  • Function flow tracing: ftrace function graphing.
  • PMC statistics: perf.
  • CPU sampling with timestamps: perf, as its perf.data output has been optimized.

For the last one, sometimes I need timestamps on samples, but often I don't and can just use bpftrace; for example, sampling kernel stacks at 99 Hertz:

    # bpftrace -e 'profile:hz:99 { @[kstack] = count(); }'

Conclusion

Times have changed for Linux. It now has an advanced tracer, bpftrace, built from the ground-up for extended BPF and Linux that is solving real production problems at Netflix and other companies. With simple one-liners or short tools you can inspect your own code in custom ways. This article showed lots of examples.

If you'd like to see how your code is running in Netflix production, email me your bpftrace program and I can reply with the output (so long as it isn't company/customer revealing). If it helps you discover some inefficiencies with how Linux is executing our workload, and you develop a fix, that's great for Netflix. I am bgregg@netflix.com.

[Thanks to Alastair Robertson for creating bpftrace, and the bpftrace, BCC, and BPF communities for all the work over the past five years.]


Index entries for this article
KernelBPF/Tracing
KernelTracing/with BPF
GuestArticlesGregg, Brendan


to post comments

Kernel analysis with bpftrace

Posted Jul 18, 2019 19:04 UTC (Thu) by SPYFF (subscriber, #131114) [Link]

Really good introduction to bpftrace. Finally a good alternative to perf probe creation then perf trace. I would really like to see something BPF based function flow tracing/graphing, that would be a true killer!

Kernel analysis with bpftrace

Posted Jul 18, 2019 19:14 UTC (Thu) by dw (subscriber, #12017) [Link] (7 responses)

Very excited about BTF arriving, but I'm a little confused. I expected this would mean every typical kernel would ship with type info, but read a few things in the past week to suggest this isn't quite how distros are actually going to treat it? (Sorry, lost links). The advantage of BTF for me is not having to mess with essentially getting a 'development setup' on any random machine that might need to be touched, but separate dbginfo-type package installation is exactly that

Kernel analysis with bpftrace

Posted Jul 18, 2019 20:16 UTC (Thu) by nix (subscriber, #2304) [Link] (5 responses)

As a side note -- tangential but I can't resist -- CTF support has already landed in binutils (only the library at first, but the linker portion is under review now and does work in my testing). This should let us have CTF type info in the kernel as well, built from the toolchain-generated CTF info :) and it's not stripped by strip(1), either, so the thing is always present in the binaries. (I will eventually teach it to handle FreeBSD etc CTF info as well -- but BTF is off the table, it's just too different.)

(Obviously, it's non-loaded info, but it's crazy to parse this sort of stuff in kernel space anyway. This is for userspace tools to use when analysing other things... like, say, the kernel. The kernel is actually a special case: because of network booting, boots from /boot that is then unmounted etc it is quite common to boot from a kernel you can't actually access, and the kernel has *vast* amounts of type info in it, far too much to sanely load all of it at once. So we'll keep the kernel's type info in a separate, compressed, loadable archive. It's only every other binary that gets its type info in a built-in .ctf section.)

Kernel analysis with bpftrace

Posted Jul 18, 2019 20:23 UTC (Thu) by SEJeff (guest, #51588) [Link] (1 responses)

Is there a git repo you have so we can follow along by chance?

CTF

Posted Jul 19, 2019 21:03 UTC (Fri) by mjw (subscriber, #16740) [Link]

It is already in the upstream binutils git repo:
https://sourceware.org/git/?p=binutils-gdb.git;a=history;...

With additional work being discussed on the binutils mailinglist:
http://sourceware.org/ml/binutils/current

Support for the linker posted here:
https://sourceware.org/ml/binutils/2019-07/msg00159.html

The GCC support just saw its V4 RFC:
https://gcc.gnu.org/ml/gcc-patches/2019-07/msg01209.html

Kernel analysis with bpftrace

Posted Jul 19, 2019 13:57 UTC (Fri) by nivedita76 (guest, #121790) [Link]

is there a doc somewhere? What is the use case for type info for binaries without debug info?

Kernel analysis with bpftrace

Posted Jul 19, 2019 15:25 UTC (Fri) by bgregg (subscriber, #46639) [Link] (1 responses)

Who are still the end users of CTF? I wanted it years ago, but now we have BTF.

Kernel analysis with bpftrace

Posted Jul 19, 2019 15:43 UTC (Fri) by dw (subscriber, #12017) [Link]

FreeBSD is a heavy user

Kernel analysis with bpftrace

Posted Jul 18, 2019 20:58 UTC (Thu) by bgregg (subscriber, #46639) [Link]

Don't overlook how much can be done from linux headers, without BTF. Out of the 100+ new tools I developed in the BPF book, only three of them hit issues with missing structs where I needed to manually declare them. Although, that ratio will get higher (worse) the deeper you dig into things.

Facebook have already added BTF to the build process as CONFIG_DEBUG_INFO_BTF, and I've heard that it can be included in vmlinux, so the kernel is self describing. We will need to convince distros to enable this and whatever options to deliver BTF-embedded vmlinux's by default (I can ask Canonical for Ubuntu). The BTF dedup work keeps it small, so adding it by default should be acceptable.

I don't think there is a howto about building vmlinux with BTF yet. Arnarldo did talk about the recent BTF work[0] and there's an older post from Andrii about it[1], and there's the BTF docs in the kernel [2].

[0] http://vger.kernel.org/~acme/perf/btf-perf-pahole-lsfmm-s...
[1] https://facebookmicrosites.github.io/bpf/blog/2018/11/14/...
[2] https://www.kernel.org/doc/html/latest/bpf/btf.html#btf-g...

Kernel analysis with bpftrace

Posted Jul 18, 2019 23:14 UTC (Thu) by unixbhaskar (guest, #44758) [Link]

Wonderful! Brendan ...enjoyed very much!

Kernel analysis with bpftrace

Posted Jul 18, 2019 23:17 UTC (Thu) by dgc (subscriber, #6611) [Link] (4 responses)

Brendan, running on debian and installing the distro bpftrace package, stuff like your off-wake.bt script doesn't work.

The problem is that your script does this:

#include <linux/sched.h>

and it assume that it picks up a kernel internal header file, not the uapi/linux/sched.h that the distro has installed in /usr/include/linux/. Hence:

$ sudo ./offwake.bt
Unknown struct/union: 'task_struct'
Unknown struct/union: 'task_struct'
$

And pointing it at the installed kernel header package via:

$ sudo ./offwake.bt -I /usr/src/linux-headers-5.0.0-trunk-common/include -I /usr/src/linux-headers-5.0.0-trunk-common/arch/x86/include/
/usr/src/linux-headers-5.0.0-trunk-common/arch/x86/include/asm/bitops.h:134:2: warning: implicit declaration of function 'barrier' is invalid in C99 [-Wimplicit-function-declaration]
.....
fatal error: too many errors emitted, stopping now [-ferror-limit=]

Throws so many warnings/errors that it aborts.

So there's some magic in your setup that is not apparent from either the article, your code samples or the documentation. Any ideas?

-Dave.

Kernel analysis with bpftrace

Posted Jul 18, 2019 23:44 UTC (Thu) by bgregg (subscriber, #46639) [Link] (3 responses)

I'd raise this as a bpftrace issue please[0]; my guess is you've got an older version of LLVM, but the Debian package description does say llvm-7, which should be fine.

Using bpftrace to check how bpftrace is running that tool on my system:

# BPFTRACE_STRLEN=128 bpftrace -e 't:syscalls:sys_enter_openat /comm == "bpftrace"/ { printf("opening: %s\n", str(args->filename)); }' | grep sched
opening: /lib/modules/4.15.0-54-generic/build/include/linux/sched.h
opening: /lib/modules/4.15.0-54-generic/build/include/uapi/linux/sched.h
opening: /lib/modules/4.15.0-54-generic/build/include/asm-generic/bitops/sched.h
opening: /lib/modules/4.15.0-54-generic/build/include/linux/sched/prio.h

(We're switching from BPF stack to map storage for strings, so needing to increase BPFTRACE_STRLEN should become a thing of the past.)

[0] https://github.com/iovisor/bpftrace/issues

Kernel analysis with bpftrace

Posted Jul 19, 2019 0:44 UTC (Fri) by dgc (subscriber, #6611) [Link] (2 responses)

Ok, so there's no /lib/modules/<vers>/build directories on any of my machines. And my test VMs don't even run a modular kernel (it's supplied externally by qemu invocation) so I'm guessing this is the issue. I'll have a bit of a look around to see if I can get this populated and whether than makes the problem go away. If I can't solve it easily, then I'll raise an issue for it.

Thanks!

-Dave.

Kernel analysis with bpftrace

Posted Jul 26, 2019 16:59 UTC (Fri) by BenHutchings (subscriber, #37955) [Link]

If you build the kernel with "make bindeb-pkg" you'll get those kernel headers packaged up in a linux-headers-<release> package, same as for a distribution kernel.

Kernel analysis with bpftrace

Posted Jul 27, 2019 5:56 UTC (Sat) by flussence (guest, #85566) [Link]

I was wondering what that new CONFIG_IKHEADERS in kernel 5.2 could possibly be useful for; this scenario (kernel image living outside the known filesystem) sounds like exactly it.

Kernel analysis with bpftrace

Posted Jul 19, 2019 9:13 UTC (Fri) by nilsmeyer (guest, #122604) [Link]

That still doesn't make up for Netflix cancelling shows I like.

Kernel analysis with bpftrace

Posted Mar 21, 2022 13:25 UTC (Mon) by ideal (guest, #157529) [Link]

# bpftrace -e 't:timer:hrtimer_start { @[ksym(args->function)] = count(); }'

on kernel v5.16.x,need to include <linux/hrtimer.h>,or else enum hrtimer_mode is not defined.


Copyright © 2019, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds