Data-type profiling for perf
Tooling for profiling the effects of memory usage and layout has always lagged behind that for profiling processor activity, so Namhyung Kim's patch set for data-type profiling in perf is a welcome addition. It provides aggregated breakdowns of memory accesses by data type that can inform structure layout and access pattern changes. Existing tools have either, like heaptrack, focused on profiling allocations, or, like perf mem, on accounting memory accesses only at the address level. This new work builds on the latter, using DWARF debugging information to correlate memory operations with their source-level types.
Recent kernel history is full of examples of commits that reorder structures, pad fields, or pack them to improve performance. But how does one discover structures in need of optimization and characterize access to them to make such decisions? Pahole gives a static view of how data structures span cache lines and where padding exists, but can't reveal anything about access patterns. perf c2c is a powerful tool for identifying cache-line contention, but won't reveal anything useful for single-threaded access. To understand the access behavior of a running program, a broader picture of accesses to data structures is needed. This is where Kim's data type profiling work comes in.
Take, for example, this
recent change to perf from Ian Rogers, who described it tersely as:
"Avoid 6 byte hole for padding. Place more frequently used fields
first in an attempt to use just 1 cache line in the common case.
"
This is a classic structure-reordering optimization. Rogers quotes
pahole's output for the structure in question before the optimization:
struct callchain_list { u64 ip; /* 0 8 */ struct map_symbol ms; /* 8 24 */ struct { _Bool unfolded; /* 32 1 */ _Bool has_children; /* 33 1 */ }; /* 32 2 */ /* XXX 6 bytes hole, try to pack */ u64 branch_count; /* 40 8 */ u64 from_count; /* 48 8 */ u64 predicted_count; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ u64 abort_count; /* 64 8 */ u64 cycles_count; /* 72 8 */ u64 iter_count; /* 80 8 */ u64 iter_cycles; /* 88 8 */ struct branch_type_stat * brtype_stat; /* 96 8 */ const char * srcline; /* 104 8 */ struct list_head list; /* 112 16 */ /* size: 128, cachelines: 2, members: 13 */ /* sum members: 122, holes: 1, sum holes: 6 */ };
We can see that there is a hole, and that the whole structure spans two cache lines, but not much more than that. Rogers's patch moves the list_head structure up to fill the reported hole and, at the same time, put a heavily accessed structure into the same cache line as the other frequently used data. Making a change like that, though, requires knowledge of which fields are most often accessed. This is where perf's new data type profiling comes in.
To use it, one starts by sampling memory operations with:
perf mem record
Intel, AMD, and Arm each have some support for recording precise memory events on their contemporary processors, but this support varies in how comprehensive it is. On processors that support separating load and store profiling (such as Arm SPE or Intel PEBS), a command like:
perf mem record -t store
can be used to find fields that are heavily written. Here, we'll use it on perf report itself with a reasonably sized call chain to evaluate the change.
Once a run has been done with the above command, it is time to use the resulting data to do the data-type profile. Kim's changes add a new command:
perf annotate --data-type
that prints structures with samples per field; it can be narrowed to a single type by providing an argument. This is what the output from:
perf annotate --data-type=callchain_list
looks like before Rogers's patch (with the most active fields highlighted in bold):
Annotate type: 'struct callchain_list' in [...]/tools/perf/perf (218 samples): ============================================================================ samples offset size field 218 0 128 struct callchain_list { 18 0 8 u64 ip; 157 8 24 struct map_symbol ms { 0 8 8 struct maps* maps; 60 16 8 struct map* map; 97 24 8 struct symbol* sym; }; 0 32 2 struct { 0 32 1 _Bool unfolded; 0 33 1 _Bool has_children; }; 0 40 8 u64 branch_count; 0 48 8 u64 from_count; 0 56 8 u64 predicted_count; 0 64 8 u64 abort_count; 0 72 8 u64 cycles_count; 0 80 8 u64 iter_count; 0 88 8 u64 iter_cycles; 0 96 8 struct branch_type_stat* brtype_stat; 0 104 8 char* srcline; 43 112 16 struct list_head list { 43 112 8 struct list_head* next; 0 120 8 struct list_head* prev; }; };
This makes the point of the patch clear. We can see that list is the only field on the second cache line that is accessed as part of this workload. If that field could be moved to the first cache line, the cache behavior of the application should improve. Data-type profiling lets us verify that assumption; its output after the patch looks like:
Annotate type: 'struct callchain_list' in [...]/tools/perf/perf (154 samples): ============================================================================ samples offset size field 154 0 128 struct callchain_list { 28 0 16 struct list_head list { 28 0 8 struct list_head* next; 0 8 8 struct list_head* prev; }; 9 16 8 u64 ip; 116 24 24 struct map_symbol ms { 1 24 8 struct maps* maps; 60 32 8 struct map* map; 55 40 8 struct symbol* sym; }; 1 48 8 char* srcline; 0 56 8 u64 branch_count; 0 64 8 u64 from_count; 0 72 8 u64 cycles_count; 0 80 8 u64 iter_count; 0 88 8 u64 iter_cycles; 0 96 8 struct branch_type_stat* brtype_stat; 0 104 8 u64 predicted_count; 0 112 8 u64 abort_count; 0 120 2 struct { 0 120 1 _Bool unfolded; 0 121 1 _Bool has_children; }; };
For this workload, at least, the access patterns are as advertised. Some quick perf stat benchmarking revealed that the instructions-per-cycle count had increased and the time elapsed had decreased as a consequence of the change.
Anyone who has spent a lot of time scrutinizing pahole output, trying to shuffle structure members to balance size, cache-line access, false sharing, and so on, is likely to find this useful. (Readers who have not yet delved into this rabbit hole might want to start with Ulrich Drepper's series on LWN, "What every programmer should know about memory", specifically part 5, "What programmers can do".)
Data-type profiling obviously needs information about the program it is looking at to be able to do its job; specifically, identifying the data type associated with a load or store requires that there is DWARF debugging information for locations, variables, and types. Any language supported by perf should work. The author verified that, aside from C, Rust and Go programs produce reasonable, though not always idiomatic for the language involved, output.
After sampling memory accesses, data-type aggregation correlates sampled instruction arguments with locations in the associated DWARF information, and then with their type. As is often the case in profiling, compiler optimizations can impede this search. This unfortunately means that there are cases where perf won't associate a memory event with a type because the DWARF information either wasn't thorough enough, or was too complex for perf to interpret.
Kim spoke about this work at the 2023 Linux Plumbers Conference (video), and noted situations involving chains of pointers as a common case that isn't supported well currently. While he has a workaround for this problem, he also pointed out that there is a proposal for inverted location lists in DWARF that would be a more general solution.
For any given program address (usually the current program counter (PC)), location lists in DWARF [large PDF] allow a debugging tool to look up how a symbol is currently stored; it can be a location description, which may indicate the symbol is currently stored in a register, or an address. What tools like perf would rather have is a mapping from an address or register to a symbol. This is effectively an inversion of location lists, but computing this inversion is much less expensive for the compiler emitting the debugging information in the first place. This has been a sore spot for perf in the past, judging from the discussion between Arnaldo Carvalho de Melo and Peter Zijlstra during the former's Linux Plumbers Conference 2022 talk (video) on profiling data structures.
As of this article, Kim's work is unmerged but, since the changes are
only in user space, it's possible to try them out easily by building
perf from Kim's
perf/data-profile-v3 branch. Given the enthusiastic reactions
to the v1
patch set from perf tools maintainer Arnaldo Carvalho de
Melo, Peter
Zijlstra, and Ingo Molnar, it
seems likely that it won't remain unmerged for long.
Index entries for this article | |
---|---|
GuestArticles | Squires, Julian |
Posted Dec 21, 2023 16:51 UTC (Thu)
by pctammela (guest, #126687)
[Link]
Posted Dec 22, 2023 8:11 UTC (Fri)
by Sesse (subscriber, #53779)
[Link]
Posted Dec 22, 2023 12:12 UTC (Fri)
by acme (subscriber, #2443)
[Link] (4 responses)
Great article! Should encourage people to test it and help with finding issues, fixing problems and adding more features.
Posted Mar 23, 2024 9:36 UTC (Sat)
by Hi-Angel (guest, #110915)
[Link] (3 responses)
I'm on 6.8.1 and calling a `perf annotate --data-type` results in a `Error: unknown option `data-type'`. So it didn't make it.
Posted Mar 23, 2024 9:40 UTC (Sat)
by Hi-Angel (guest, #110915)
[Link] (2 responses)
Posted Apr 5, 2024 11:50 UTC (Fri)
by Hi-Angel (guest, #110915)
[Link] (1 responses)
LINK perf
Posted Apr 5, 2024 12:01 UTC (Fri)
by Hi-Angel (guest, #110915)
[Link]
Posted Dec 23, 2023 5:15 UTC (Sat)
by roc (subscriber, #30627)
[Link] (3 responses)
Posted Dec 26, 2023 21:08 UTC (Tue)
by DanilaBerezin (guest, #168271)
[Link] (2 responses)
Posted Dec 27, 2023 10:07 UTC (Wed)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted Dec 27, 2023 10:14 UTC (Wed)
by taladar (subscriber, #68407)
[Link]
Posted Dec 24, 2023 0:16 UTC (Sun)
by dankamongmen (subscriber, #35141)
[Link]
Posted Dec 26, 2023 21:27 UTC (Tue)
by rywang014 (subscriber, #167182)
[Link]
Posted Jan 4, 2024 10:47 UTC (Thu)
by rwmj (subscriber, #5474)
[Link]
Data-type profiling for perf
Data-type profiling for perf
Data-type profiling for perf
It's not in 6.8
It's not in 6.8
It's not in 6.8
/usr/bin/ld: /usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../lib/Scrt1.o: in function `_start':
(.text+0x1b): undefined reference to `main'
/usr/bin/ld: pmu-events/pmu-events-in.o: in function `map_for_pmu':
pmu-events.c:(.text+0x174): undefined reference to `perf_pmu__getcpuid'
/usr/bin/ld: pmu-events.c:(.text+0x1ac): undefined reference to `strcmp_cpuid_str'
/usr/bin/ld: pmu-events/pmu-events-in.o: in function `pmu_events_table__for_each_event':
(.text+0x311): undefined reference to `pmu__name_match'
/usr/bin/ld: pmu-events/pmu-events-in.o: in function `pmu_events_table__find_event':
(.text+0x4ce): undefined reference to `pmu__name_match'
/usr/bin/ld: (.text+0x5e0): undefined reference to `pmu__name_match'
/usr/bin/ld: pmu-events/pmu-events-in.o: in function `pmu_events_table__num_events':
(.text+0x6f7): undefined reference to `pmu__name_match'
/usr/bin/ld: pmu-events/pmu-events-in.o: in function `perf_pmu__find_events_table':
(.text+0xa33): undefined reference to `pmu__name_match'
/usr/bin/ld: pmu-events/pmu-events-in.o:(.text+0xae3): more undefined references to `pmu__name_match' follow
/usr/bin/ld: pmu-events/pmu-events-in.o: in function `find_core_events_table':
(.text+0xb79): undefined reference to `strcmp_cpuid_str'
/usr/bin/ld: pmu-events/pmu-events-in.o: in function `find_core_metrics_table':
(.text+0xc09): undefined reference to `strcmp_cpuid_str'
/usr/bin/ld: /home/constantine/Projects/builds/linux-tools/src/linux/tools/perf/libsymbol/libsymbol.a(libsymbol-in.o): in function `__tolower':
/home/constantine/Projects/builds/linux-tools/src/linux/tools/lib/symbol/kallsyms.c:52:(.text+0x10): undefined reference to `_ctype'
/usr/bin/ld: /home/constantine/Projects/builds/linux-tools/src/linux/tools/perf/libsymbol/libsymbol.a(libsymbol-in.o): in function `__toupper':
/home/constantine/Projects/builds/linux-tools/src/linux/tools/lib/symbol/kallsyms.c:59:(.text+0x3e): undefined reference to `_ctype'
collect2: error: ld returned 1 exit status
It's not in 6.8
Data-type profiling for perf
Data-type profiling for perf
Data-type profiling for perf
Wol
Data-type profiling for perf
Data-type profiling for perf
Data-type profiling for perf
Data-type profiling for perf