Data-type profiling for perf

December 21, 2023

This article was contributed by Julian Squires

Tooling for profiling the effects of memory usage and layout has always lagged behind that for profiling processor activity, so Namhyung Kim's patch set for data-type profiling in perf is a welcome addition. It provides aggregated breakdowns of memory accesses by data type that can inform structure layout and access pattern changes. Existing tools have either, like heaptrack, focused on profiling allocations, or, like perf mem, on accounting memory accesses only at the address level. This new work builds on the latter, using DWARF debugging information to correlate memory operations with their source-level types.

Recent kernel history is full of examples of commits that reorder structures, pad fields, or pack them to improve performance. But how does one discover structures in need of optimization and characterize access to them to make such decisions? Pahole gives a static view of how data structures span cache lines and where padding exists, but can't reveal anything about access patterns. perf c2c is a powerful tool for identifying cache-line contention, but won't reveal anything useful for single-threaded access. To understand the access behavior of a running program, a broader picture of accesses to data structures is needed. This is where Kim's data type profiling work comes in.

Take, for example, this recent change to perf from Ian Rogers, who described it tersely as: "Avoid 6 byte hole for padding. Place more frequently used fields first in an attempt to use just 1 cache line in the common case." This is a classic structure-reordering optimization. Rogers quotes pahole's output for the structure in question before the optimization:

    struct callchain_list {
        u64                        ip;                   /*     0     8 */
        struct map_symbol          ms;                   /*     8    24 */
        struct {
                _Bool              unfolded;             /*    32     1 */
                _Bool              has_children;         /*    33     1 */
        };                                               /*    32     2 */

        /* XXX 6 bytes hole, try to pack */

        u64                        branch_count;         /*    40     8 */
        u64                        from_count;           /*    48     8 */
        u64                        predicted_count;      /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        u64                        abort_count;          /*    64     8 */
        u64                        cycles_count;         /*    72     8 */
        u64                        iter_count;           /*    80     8 */
        u64                        iter_cycles;          /*    88     8 */
        struct branch_type_stat *  brtype_stat;          /*    96     8 */
        const char  *              srcline;              /*   104     8 */
        struct list_head           list;                 /*   112    16 */

        /* size: 128, cachelines: 2, members: 13 */
        /* sum members: 122, holes: 1, sum holes: 6 */
    };

We can see that there is a hole, and that the whole structure spans two cache lines, but not much more than that. Rogers's patch moves the list_head structure up to fill the reported hole and, at the same time, put a heavily accessed structure into the same cache line as the other frequently used data. Making a change like that, though, requires knowledge of which fields are most often accessed. This is where perf's new data type profiling comes in.

To use it, one starts by sampling memory operations with:

    perf mem record

Intel, AMD, and Arm each have some support for recording precise memory events on their contemporary processors, but this support varies in how comprehensive it is. On processors that support separating load and store profiling (such as Arm SPE or Intel PEBS), a command like:

    perf mem record -t store

can be used to find fields that are heavily written. Here, we'll use it on perf report itself with a reasonably sized call chain to evaluate the change.

Once a run has been done with the above command, it is time to use the resulting data to do the data-type profile. Kim's changes add a new command:

    perf annotate --data-type

that prints structures with samples per field; it can be narrowed to a single type by providing an argument. This is what the output from:

    perf annotate --data-type=callchain_list

looks like before Rogers's patch (with the most active fields highlighted in bold):

    Annotate type: 'struct callchain_list' in [...]/tools/perf/perf (218 samples):
    ============================================================================
    samples     offset       size  field
        218          0        128  struct callchain_list         {
         18          0          8      u64      ip;
        157          8         24      struct map_symbol        ms {
          0          8          8          struct maps* maps;
         60         16          8          struct map*  map;
         97         24          8          struct symbol*       sym;
                                       };
          0         32          2      struct    {
          0         32          1          _Bool        unfolded;
          0         33          1          _Bool        has_children;
                                       };
          0         40          8      u64      branch_count;
          0         48          8      u64      from_count;
          0         56          8      u64      predicted_count;
          0         64          8      u64      abort_count;
          0         72          8      u64      cycles_count;
          0         80          8      u64      iter_count;
          0         88          8      u64      iter_cycles;
          0         96          8      struct branch_type_stat* brtype_stat;
          0        104          8      char*    srcline;
         43        112         16      struct list_head list {
         43        112          8          struct list_head*    next;
          0        120          8          struct list_head*    prev;
                                       };
                                   };

This makes the point of the patch clear. We can see that list is the only field on the second cache line that is accessed as part of this workload. If that field could be moved to the first cache line, the cache behavior of the application should improve. Data-type profiling lets us verify that assumption; its output after the patch looks like:

    Annotate type: 'struct callchain_list' in [...]/tools/perf/perf (154 samples):
    ============================================================================
    samples     offset       size  field
        154          0        128  struct callchain_list         {
         28          0         16      struct list_head list {
         28          0          8          struct list_head*    next;
          0          8          8          struct list_head*    prev;
                                       };
          9         16          8      u64      ip;
        116         24         24      struct map_symbol        ms {
          1         24          8          struct maps* maps;
         60         32          8          struct map*  map;
         55         40          8          struct symbol*       sym;
                                       };
          1         48          8      char*    srcline;
          0         56          8      u64      branch_count;
          0         64          8      u64      from_count;
          0         72          8      u64      cycles_count;
          0         80          8      u64      iter_count;
          0         88          8      u64      iter_cycles;
          0         96          8      struct branch_type_stat* brtype_stat;
          0        104          8      u64      predicted_count;
          0        112          8      u64      abort_count;
          0        120          2      struct    {
          0        120          1          _Bool        unfolded;
          0        121          1          _Bool        has_children;
                                       };
                                   };

For this workload, at least, the access patterns are as advertised. Some quick perf stat benchmarking revealed that the instructions-per-cycle count had increased and the time elapsed had decreased as a consequence of the change.

Anyone who has spent a lot of time scrutinizing pahole output, trying to shuffle structure members to balance size, cache-line access, false sharing, and so on, is likely to find this useful. (Readers who have not yet delved into this rabbit hole might want to start with Ulrich Drepper's series on LWN, "What every programmer should know about memory", specifically part 5, "What programmers can do".)

Data-type profiling obviously needs information about the program it is looking at to be able to do its job; specifically, identifying the data type associated with a load or store requires that there is DWARF debugging information for locations, variables, and types. Any language supported by perf should work. The author verified that, aside from C, Rust and Go programs produce reasonable, though not always idiomatic for the language involved, output.

After sampling memory accesses, data-type aggregation correlates sampled instruction arguments with locations in the associated DWARF information, and then with their type. As is often the case in profiling, compiler optimizations can impede this search. This unfortunately means that there are cases where perf won't associate a memory event with a type because the DWARF information either wasn't thorough enough, or was too complex for perf to interpret.

Kim spoke about this work at the 2023 Linux Plumbers Conference (video), and noted situations involving chains of pointers as a common case that isn't supported well currently. While he has a workaround for this problem, he also pointed out that there is a proposal for inverted location lists in DWARF that would be a more general solution.

For any given program address (usually the current program counter (PC)), location lists in DWARF [large PDF] allow a debugging tool to look up how a symbol is currently stored; it can be a location description, which may indicate the symbol is currently stored in a register, or an address. What tools like perf would rather have is a mapping from an address or register to a symbol. This is effectively an inversion of location lists, but computing this inversion is much less expensive for the compiler emitting the debugging information in the first place. This has been a sore spot for perf in the past, judging from the discussion between Arnaldo Carvalho de Melo and Peter Zijlstra during the former's Linux Plumbers Conference 2022 talk (video) on profiling data structures.

As of this article, Kim's work is unmerged but, since the changes are only in user space, it's possible to try them out easily by building perf from Kim's perf/data-profile-v3 branch. Given the enthusiastic reactions to the v1 patch set from perf tools maintainer Arnaldo Carvalho de Melo, Peter Zijlstra, and Ingo Molnar, it seems likely that it won't remain unmerged for long.

Index entries for this article
GuestArticles	Squires, Julian

Data-type profiling for perf

Posted Dec 21, 2023 16:51 UTC (Thu) by pctammela (guest, #126687) [Link]

This is really nice, great work!

Data-type profiling for perf

Posted Dec 22, 2023 8:11 UTC (Fri) by Sesse (subscriber, #53779) [Link]

Oh wow, I've been wanting this for some time. At some point, I even wrote a hack for something similar myself :-)

Data-type profiling for perf

Posted Dec 22, 2023 12:12 UTC (Fri) by acme (subscriber, #2443) [Link] (4 responses)

It's in https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf..., probably will move to perf-tools-next later today, on its way to Linux v6.8 in January.

Great article! Should encourage people to test it and help with finding issues, fixing problems and adding more features.

It's not in 6.8

Posted Mar 23, 2024 9:36 UTC (Sat) by Hi-Angel (guest, #110915) [Link] (3 responses)

> on its way to Linux v6.8 in January

I'm on 6.8.1 and calling a `perf annotate --data-type` results in a `Error: unknown option `data-type'`. So it didn't make it.

It's not in 6.8

Posted Mar 23, 2024 9:40 UTC (Sat) by Hi-Angel (guest, #110915) [Link] (2 responses)

Oh, actually, sorry for the confusion, it seems that for some reason Archlinux packages updated the kernel but still didn't update `perf` utility that goes with it, Idk why. It was marked outdated 12 days ago. Anyway, please disregard my comment, will re-test it once Arch update perf to the kernel version.

It's not in 6.8

Posted Apr 5, 2024 11:50 UTC (Fri) by Hi-Angel (guest, #110915) [Link] (1 responses)

FTR, apparently there's some problem with `perf` in 6.8. I was tired of waiting for 6.8 perf to appear in Arch repos so decided to download PKGBUILD and compile it myself. Well, now I know why it takes for long for perf to get updated: it's unbuildable, the linking stage fails with (also, Idk why text below is not getting aligned, I inserted indentation before it 🤷‍♂️):

LINK perf
/usr/bin/ld: /usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../lib/Scrt1.o: in function `_start':
(.text+0x1b): undefined reference to `main'
/usr/bin/ld: pmu-events/pmu-events-in.o: in function `map_for_pmu':
pmu-events.c:(.text+0x174): undefined reference to `perf_pmu__getcpuid'
/usr/bin/ld: pmu-events.c:(.text+0x1ac): undefined reference to `strcmp_cpuid_str'
/usr/bin/ld: pmu-events/pmu-events-in.o: in function `pmu_events_table__for_each_event':
(.text+0x311): undefined reference to `pmu__name_match'
/usr/bin/ld: pmu-events/pmu-events-in.o: in function `pmu_events_table__find_event':
(.text+0x4ce): undefined reference to `pmu__name_match'
/usr/bin/ld: (.text+0x5e0): undefined reference to `pmu__name_match'
/usr/bin/ld: pmu-events/pmu-events-in.o: in function `pmu_events_table__num_events':
(.text+0x6f7): undefined reference to `pmu__name_match'
/usr/bin/ld: pmu-events/pmu-events-in.o: in function `perf_pmu__find_events_table':
(.text+0xa33): undefined reference to `pmu__name_match'
/usr/bin/ld: pmu-events/pmu-events-in.o:(.text+0xae3): more undefined references to `pmu__name_match' follow
/usr/bin/ld: pmu-events/pmu-events-in.o: in function `find_core_events_table':
(.text+0xb79): undefined reference to `strcmp_cpuid_str'
/usr/bin/ld: pmu-events/pmu-events-in.o: in function `find_core_metrics_table':
(.text+0xc09): undefined reference to `strcmp_cpuid_str'
/usr/bin/ld: /home/constantine/Projects/builds/linux-tools/src/linux/tools/perf/libsymbol/libsymbol.a(libsymbol-in.o): in function `__tolower':
/home/constantine/Projects/builds/linux-tools/src/linux/tools/lib/symbol/kallsyms.c:52:(.text+0x10): undefined reference to `_ctype'
/usr/bin/ld: /home/constantine/Projects/builds/linux-tools/src/linux/tools/perf/libsymbol/libsymbol.a(libsymbol-in.o): in function `__toupper':
/home/constantine/Projects/builds/linux-tools/src/linux/tools/lib/symbol/kallsyms.c:59:(.text+0x3e): undefined reference to `_ctype'
collect2: error: ld returned 1 exit status

It's not in 6.8

Posted Apr 5, 2024 12:01 UTC (Fri) by Hi-Angel (guest, #110915) [Link]

Okay, I figured out what it's caused by: it's because I have `-flto` in default options and evidently there's some bug in `perf` that makes it break when that's defined. After removing flto I managed to compile it.

Data-type profiling for perf

Posted Dec 23, 2023 5:15 UTC (Sat) by roc (subscriber, #30627) [Link] (3 responses)

We don't really want compilers to emit redundant DWARF tables. That slows down builds and creates bloated binaries. A better approach would be to have a tool that can build inverted location lists from the regular location lists, persistently caching the results by build-ID when that's helpful.

Data-type profiling for perf

Posted Dec 26, 2023 21:08 UTC (Tue) by DanilaBerezin (guest, #168271) [Link] (2 responses)

I think slower builds and bloated binaries are an okay trade off for a debug build. But in general, yeah I would agree, I think if it's possible to create a secondary program that inverts the lists after the build, that would probably be preferable.

Data-type profiling for perf

Posted Dec 27, 2023 10:07 UTC (Wed) by Wol (subscriber, #4433) [Link]

Sounds to me like a straightforward database file with index ...

Cheers,
Wol

Data-type profiling for perf

Posted Dec 27, 2023 10:14 UTC (Wed) by taladar (subscriber, #68407) [Link]

If it is cheaper for the compiler to compute writing the information to a separate file as part of the compile process might also be an option.

Data-type profiling for perf

Posted Dec 24, 2023 0:16 UTC (Sun) by dankamongmen (subscriber, #35141) [Link]

this looks absolutely outstanding

Data-type profiling for perf

Posted Dec 26, 2023 21:27 UTC (Tue) by rywang014 (subscriber, #167182) [Link]

Can we do some large scale automations to discover more layout optimizations? It can run a wide range of benchmarks with this tool, and find if there are structs with multiple active cache lines and can be shuffled to a same cache line.

Data-type profiling for perf

Posted Jan 4, 2024 10:47 UTC (Thu) by rwmj (subscriber, #5474) [Link]

This looks fantastic. Next step would be some kind of latency analysis. I wonder if it's possible to see which fields have high latency for writes (which might indicate a cache line "ping-ponging" between cores)?