Comparing BPF performance between implementations
Alan Jowett returned for a second remote presentation at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit to compare the performance of different BPF runtimes. He showed the results of the MIT-licensed BPF microbenchmark suite he has been working on. The benchmark suite does not yet provide a good direct comparison between all platforms, so the results should be taken with a grain of salt. They do seem to indicate that there is some significant variation between implementations, especially for different types of BPF maps.
The benchmarks measure a few different things, including the time taken to actually execute various test programs, but also the overhead of transitioning from the kernel into the BPF VM, and the performance of calling helper functions. It is important to measure these things in a platform-neutral, repeatable way, Jowett said. The benchmark suite uses libbpf to load BPF programs, which uses "compile once — run everywhere" (CO-RE) to run the same ELF artifacts on the different supported platforms.
There are several different kinds of BPF programs included in the benchmark suite, including an empty (no-op) program, programs that exercise various helper functions, and programs that test the performance of BPF maps, including trie and hash-table maps. Measurements are taken on multiple CPU cores in parallel, to make testing the performance of concurrent maps more meaningful.
The eBPF for Windows project uses the benchmark suite as part of its daily continuous-integration (CI) setup to track performance regressions. The CI also runs the same tests on Linux, but he said that these weren't a good comparison because of infrastructure issues — the GitHub runners the CI uses can't specify a particular Linux kernel version. He also noted that there would be some variation because Windows uses ahead-of-time (AOT) compilation of BPF, rather than just-in-time (JIT) compilation as Linux does.
Despite that, Jowett thought that there were some valuable lessons to be drawn from the benchmarks. He said that AOT compilation outperforms JIT compilation, which itself outperforms interpretation. Alexei Starovoitov challenged that assertion; he said that the JIT being tested on Windows — which was Jowett's basis for comparison, given the Linux infrastructure issues — was fairly dumb, and was not enough to make generalizations about Linux's JIT. Jowett acknowledged that, and pointed out some ways that the Windows JIT could not be improved.
Jowett also showed some measurements demonstrating that least-prefix-match trie maps have faster updates than hash tables, and that Windows had trouble matching the performance of Linux's least-recently-used (LRU) tables. He noted that maintaining a global consensus on the age of keys in the table "is expensive".
With the difficulties Jowett had measuring Linux performance, however, it seems hard to say how eBPF for Windows and the Linux BPF implementation compare. Perhaps when that is resolved this work will prove a useful tool to highlight potential performance improvements.