Some advanced BCC topics

February 22, 2018

This article was contributed by Matt Fleming

The BPF virtual machine is working its way into an increasing number of kernel subsystems. The previous article in this series introduced the BPF Compiler Collection (BCC), which provides a set of tools for working with BPF. But there is more to BCC than a set of administrative tools; it also provides a development environment for those wanting to create their own BPF-based utilities. Read on for an exploration of that environment and how it can be used to create programs and attach them to tracepoints.

The BCC runtime provides a macro, TRACEPOINT_PROBE, that declares a function to be attached to a tracepoint that will be called every time the tracepoint fires. The following snippet of C code shows an empty BPF program that runs every time kmalloc() is called in the kernel:

    TRACEPOINT_PROBE(kmem, kmalloc) {
        return 0;
    }

The arguments to this macro are the category of the tracepoint and the event itself; this translates directly into the debugfs file system hierarchy layout (e.g. /sys/kernel/debug/tracing/events/category/event/). In true BCC-make-things-simple fashion, the tracepoint is automatically enabled when the BPF program is loaded.

The kmalloc() tracepoint is passed a number of arguments, which are described in the associated format file. Tracepoint arguments are accessible in BPF programs through the magic args variable. For our example, we care about args->call_site, which is the kernel instruction address of the kmalloc() call. To keep a count of the different kernel functions that call kmalloc(), we can store a counter in a hash table and use the call-site address as an index.

While BCC provides access to the full range of data structures exported by the kernel (and covered in the first article of this series), the two most frequently used are BPF_HASH and BPF_TABLE. Fundamentally, all of BCC's data structures are maps, and higher-level data structures are built on top of them; the most basic of these is BPF_TABLE. The BPF_TABLE macro takes a type of table (hash, percpu_array, or array) as an argument, and other macros, such as BPF_HASH and BPF_ARRAY are simply wrappers around BPF_TABLE. Because all data structures are maps, they all support the same core set of functions, including map.lookup(), map.update(), and map.delete(). (There are also some map-specific functions such as map.perf_read() for BPF_PERF_ARRAY and map.call() for BPF_PROG_ARRAY.)

Returning to our example program, we can store the kernel instruction-pointer address of the kmalloc() call-site (and the number of times it was called) using a BPF_HASH map and post-process it with Python. Here is the entire script, including the BPF program.

    #!/usr/bin/env python

    from bcc import BPF
    from time import sleep

    program = """
        BPF_HASH(callers, u64, unsigned long);

        TRACEPOINT_PROBE(kmem, kmalloc) {
            u64 ip = args->call_site;
            unsigned long *count;
            unsigned long c = 1;

            count = callers.lookup((u64 *)&ip);
            if (count != 0)
                c += *count;

            callers.update(&ip, &c);

            return 0;
        }
    """
    b = BPF(text=program)

    while True:
        try:
            sleep(1)
            for k,v in sorted(b["callers"].items()):
                print ("%s %u" % (b.ksym(k.value), v.value))
            print
        except KeyboardInterrupt:
            exit()

The syntax for the BPF_HASH macro is described in the BCC reference guide. The macro takes a number of optional arguments, but for most uses all you need to specify is the name of this hash table instance (callers in this example), the key data type (u64), and the value data type (unsigned long). BPF hash table entries are accessed using the lookup() function; if no entry exists for a given key, NULL is returned. update() will either insert a new key-value pair (if none exists) or update the value of an existing key. Thus, the BPF code for working with hashes can be quite compact because you can use a single function (update()) regardless of whether you're inserting a new item or updating an existing one.

Once a count has been stored in the hash table, it can be processed with Python. Accessing the table is done by indexing the BPF object (called b in the example). The resultant Python object is a HashTable (defined in the BCC Python front end) and its items are accessed using the items() function. Note that Python BCC maps provide a different set of functions than BPF maps.

items() returns a pair of Python c_long types whose values can be retrieved using the value member. For example, the following code from the example above iterates over all items in the callers hash table and prints the kernel functions (using the BCC BPF.ksym() helper function to convert kernel addresses to symbols) that invoked kmalloc() and the number of calls:

    for k,v in sorted(b["callers"].items()):
	print ("%s %u" % (b.ksym(k.value), v.value))

The output from this little program looks like:

    # ./example.py
    i915_sw_fence_await_dma_fence 4
    intel_crtc_duplicate_state 4
    SyS_memfd_create 1
    drm_atomic_state_init 4
    sg_kmalloc 7
    intel_atomic_state_alloc 4
    seq_open 504
    SyS_bpf 22

Though this example is relatively straightforward, larger tools will not be, and developers need ways to debug more complex tools. Thankfully, there are a few ways that BCC helps simplify the debugging process.

Controlling BPF program compilation and loading

Whenever a Python BPF object is instantiated, the BPF program source code contained within it is automatically compiled and loaded into the kernel. The compilation process can be controlled by passing compiler flag arguments in the cflags parameter to the BPF class constructor. These flags are passed directly to the Clang compiler, so any options that you might normally pass to the compiler can be used; all compiler warnings can be turned on with "cflags=['-Wall']", for instance.

A popular use of cflags in the official BCC tools is to pass macro definitions. For example, the xdp_drop_count.py script statically allocates an array with enough space for every online CPU using Python's multiprocessing library and Clang's -D flag:

    cflags=["-DNUM_CPUS=%d" % multiprocessing.cpu_count()])

The BPF class constructor also accepts a number of debugging flags in the debug argument. Each of these flags individually enables extra logging during either the compilation or the loading process. For example, the DEBUG_BPF flag causes the BPF bytecode to be output, which can be a last hope for those really troublesome bugs. This output looks like:

    0: (79) r1 = *(u64 *)(r1 +8)
    1: (7b) *(u64 *)(r10 -8) = r1
    2: (b7) r1 = 1
    3: (7b) *(u64 *)(r10 -16) = r1
    4: (18) r1 = 0xffff8801a6098a00
    6: (bf) r2 = r10
    7: (07) r2 += -8
    8: (85) call bpf_map_lookup_elem#1
    9: (15) if r0 == 0x0 goto pc+3
     R0=map_value(id=0,off=0,ks=8,vs=8,imm=0) R10=fp0
    10: (79) r1 = *(u64 *)(r0 +0)
     R0=map_value(id=0,off=0,ks=8,vs=8,imm=0) R10=fp0
    11: (07) r1 += 1
    12: (7b) *(u64 *)(r10 -16) = r1
    13: (18) r1 = 0xffff8801a6098a00
    15: (bf) r2 = r10
    16: (07) r2 += -8
    17: (bf) r3 = r10
    18: (07) r3 += -16
    19: (b7) r4 = 0
    20: (85) call bpf_map_update_elem#2
    21: (b7) r0 = 0
    22: (95) exit
    
    from 9 to 13: safe
    processed 22 insns, stack depth 16

This output comes directly from the in-kernel verifier and shows every instruction of bytecode emitted by Clang/LLVM, along with the register state on branch instructions. If this level of detail still isn't enough, the DEBUG_BPF_REGISTER_STATE flag generates even more verbose log messages.

For run-time debugging, bpf_trace_printk() provides a printk()-style interface for writing to /sys/kernel/debug/tracing/trace_pipe from BPF programs; those messages can then be consumed and printed in Python using the BPF.trace_print() function.

However, a major drawback of this approach is that, since the trace_pipe file is a global resource, it contains all messages written by concurrent writers, making it difficult to filter messages from a single BPF program. The preferred method is to store messages in a BPF_PERF_OUTPUT map inside the BPF program, then process them with open_perf_buffer() and kprobe_poll(). An example of this scheme is provided in the open_perf_buffer() documentation.

Using BPF with applications

This article has focused exclusively on attaching programs to kernel tracepoints, but you can also attach programs to the user-space tracepoints included with many popular applications using User Statically-Defined Tracing (USDT) probes. In the next and final article of this series, I'll cover the origin of USDT probes, the BCC tools that use them, and how you can add them to your own application.

Index entries for this article
Kernel	BPF
GuestArticles	Fleming, Matt

Some advanced BCC topics

Posted Feb 22, 2018 22:35 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

Are there any smaller (and more feature-limited) BPS languages that don't require full LLVM+Clang to compile?

Some advanced BCC topics

Posted Feb 23, 2018 2:47 UTC (Fri) by unixbhaskar (guest, #44758) [Link]

Yup, that's a good point.

Some advanced BCC topics

Posted Feb 23, 2018 15:57 UTC (Fri) by danielthompson (subscriber, #97243) [Link] (2 responses)

There's ply: https://github.com/iovisor/ply . Is that the sort of thing you have in mind?

Some advanced BCC topics

Posted Feb 23, 2018 22:17 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Yes, something like this. Would be nice to have it in Python completely to avoid having to compile it on Android.

Some advanced BCC topics

Posted Feb 24, 2018 20:04 UTC (Sat) by justincormack (subscriber, #70439) [Link]

There is a LuaJIT backend see https://github.com/iovisor/bcc/tree/0c8c179fc1283600887ef...

Some advanced BCC topics

Posted Feb 23, 2018 22:12 UTC (Fri) by blubber (guest, #84003) [Link] (1 responses)

In terms of language (although that's probably not your question), there's bpftrace which is dtrace style syntax, but it also uses LLVM in the background though
(https://github.com/ajor/bpftrace).

Another project that allows for making it easier to run bcc on remote systems is bpfd which seems to be used by Android folks. It allows to run bcc on a remotely connected system without the need to have the entire LLVM infrastructure there. The announcement of the project was here https://lkml.org/lkml/2017/12/29/137 (https://github.com/joelagnel/bpfd). Perhaps this might rather be what you could be looking for wrt remote system.

Some advanced BCC topics

Posted Feb 23, 2018 22:18 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

bpfd is a horrible hack.

It just seems a waste to require the full LLVM machinery when the target is so simple and most scripts are trivial. A good old one-pass translator will probably be more than sufficient for most of users.

Some advanced BCC topics

Posted Feb 23, 2018 2:48 UTC (Fri) by unixbhaskar (guest, #44758) [Link]

Nice write-up Matt! thanks a bunch...learned a lot.

Some advanced BCC topics

Posted Feb 23, 2018 4:20 UTC (Fri) by akkornel (subscriber, #75292) [Link] (6 responses)

Completely off-topic:

A good April 1st thing would be to post a follow-up, "Some more advanced BCC topics", and have the article be entirely about email.

Some advanced BCC topics

Posted Feb 24, 2018 1:07 UTC (Sat) by xtifr (guest, #143) [Link] (5 responses)

Not to mention the specialized 8086 C compiler you'll get if you type "apt-get install bcc" on most Debian based systems. Which claims to be important for "the development of boot loaders or BIOS-related 8086 code".

A mildly unfortunate overloading of TLAs at best.

Some advanced BCC topics

Posted Feb 26, 2018 21:34 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (4 responses)

The "bcc" tool could also have been the Borland compiler (historically). Woe to those still using it.

Some advanced BCC topics

Posted Mar 1, 2018 12:26 UTC (Thu) by ianmcc (subscriber, #88379) [Link] (3 responses)

A good compiler in its day though. Borland C++ was astounding, compared with the Microsoft C++ at the time - it was a decade or so (and a change of head of their C++ group) before Microsoft decided that any kind of standards compliance was worth aiming for.

Some advanced BCC topics

Posted Mar 2, 2018 17:39 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (2 responses)

I think it was my first C++ compiler[1] (the Dev-C++ IDE), though mostly due to it being free and me being just a grade school student. I don't remember having bad experiences with it specifically, but it has not kept up with the times…

[1]TI-GCC was probably my first C compiler.

Some advanced BCC topics

Posted Mar 3, 2018 18:51 UTC (Sat) by nix (subscriber, #2304) [Link] (1 responses)

If you want a blast from the past, http://tvision.sf.net/ might be worth a look, or would be if *.sf.net wasn't down at the moment. (It's amazing how archaic it seems now, without having changed a bit.)

Some advanced BCC topics

Posted Mar 5, 2018 15:24 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

> without having changed a bit

I see Debian Jessie in the list there, so it's not completely unchanged. ;)

Some advanced BCC topics

Posted Feb 26, 2018 15:11 UTC (Mon) by ncultra (✭ supporter ✭, #121511) [Link]

Timely and relevant article, thank you, love to see more. The previous article helped me the same day it was published with a kernel performance evaluation I was doing. I ended up running more than 6k KVM guests on a huge machine and got scheduling latency data using BCC. I spent most of my work time building LLVM for the linux distro I had to use, which is a variant of fedora. Once that task was finished, the hardest thing I had to do was to decide which particular tool of many was best to use. If anyone is on the fence because of the LLVM requirement, I would say go ahead, its worth it.