A medley of performance-related BPF patches

By Jonathan Corbet
January 2, 2020

One of the advantages of the in-kernel BPF virtual machine is that it is fast. BPF programs are just-in-time compiled and run directly by the CPU, so there is no interpreter overhead. For many of the intended use cases, though, "fast" can never be quite fast enough. It is thus unsurprising that there are currently a number of patch sets under development that are intended to speed up one aspect or another of using BPF in the system. A few, in particular, seem about ready to hit the mainline.

The BPF dispatcher

BPF programs cannot run until they are "attached" to a specific call point. Tracing programs are attached to tracepoints, while networking express data path (XDP) programs are attached to a specific network device. In general, more than one program can be attached at any given location. When it comes time to run attached programs, the kernel will work through a linked list and invoke each program in turn.

Actually executing a compiled BPF program is done with an indirect jump. Such jumps were never entirely fast, but in the age of speculative-execution vulnerabilities those jumps have been turned into retpolines — a construct that defeats a number of Spectre attacks, but which also turns indirect jumps into something that is far slower than they were before. For cases where BPF programs are invoked frequently, such as for every incoming network packet, that extra overhead hurts.

There have been a number of efforts aimed at reducing the retpoline performance penalty in various parts of the kernel. The BPF dispatcher patch set is Björn Töpel's approach to the problem for BPF programs, and for the XDP use case in particular. It maintains a machine-code trampoline containing a direct jump instruction for every attached BPF program; this trampoline must be regenerated whenever a program is added to or removed from the list. When the time comes to call a BPF program, the trampoline is invoked with the address of the program of interest; it then executes a binary search to find the direct-jump instruction corresponding to that program. The jump is then executed, causing the desired program to be run.

That may seem like a lot of overhead to replace an indirect call, but it is still faster than using a retpoline — by a factor of about three, according to the performance result posted with the patch series. In fact, indirect jumps are so expensive that the dispatcher is competitive even in the absence of retpolines, so it is enabled whether retpolines are in use or not. This code is in its fifth revision and seems likely to make its way into the mainline before too long.

Memory-mappable maps

BPF maps are the way that BPF programs store persistent data; they come in a number of varieties but are essentially associative arrays that can be shared with other BPF programs or with user space. Access to maps from within BPF programs is done by way of special helper functions; since everything happens within the kernel, this access is relatively fast. Getting at a BPF map from user space, instead, must be done with the bpf() system call, which provides operations like BPF_MAP_LOOKUP_ELEM and BPF_MAP_UPDATE_ELEM.

If one simply needs to read out the results at the end of a tracing run, calling bpf() is unlikely to be a problem. In the case of user-space programs that run for a long time and access a lot of data in BPF maps, though, the system-call overhead may well prove to be too much. Much of the time, the key to good performance is avoiding system calls as much as possible; making a call into the system for each item of data exchanged with a BPF program runs counter to that principle. Andrii Nakryiko has a partial solution to this problem in the form of memory-mappable BPF maps. It allows a user-space process to map a BPF array map (one that is indexed with simple integers) directly into its address space; thereafter, data in BPF maps can be accessed directly, with no need for system calls at all.

There are some limitations in the current patch set; only array maps can be mapped in this way, and maps containing spinlocks cannot be mapped (which makes sense, since user space will be unable to participate in the locking protocol anyway). Maps must be created with the BPF_F_MAPPABLE attribute (which causes them to be laid out differently in memory) to be mappable. This patch set has been applied to the BPF repository and can be expected to show up in the 5.6 kernel.

Batched map operations

Memory-mapping BPF maps is one way of avoiding the bpf() system call but, as seen above, it has some limitations. A different approach to reducing system calls can be seen in the batched operations patch set from Brian Vazquez. System calls are still required to access BPF map elements, but it becomes possible to access multiple elements with a single system call.

In particular, the patch set introduces four new map-related commands for the bpf() system call: BPF_MAP_LOOKUP_BATCH, BPF_MAP_LOOKUP_AND_DELETE_BATCH, BPF_MAP_UPDATE_BATCH, and BPF_MAP_DELETE_BATCH. These commands require the following structure to be passed in the bpf() call:

    struct { /* struct used by BPF_MAP_*_BATCH commands */
        __aligned_u64   in_batch;
        __aligned_u64   out_batch;
        __aligned_u64   keys;
        __aligned_u64   values;
        __u32           count;
        __u32           map_fd;
        __u64           elem_flags;
        __u64           flags;
    } batch;

For lookup operations (which, despite their name, are intended to read through a map's entries rather than look up specific entries), keys points to an array able to hold count keys; values is an array for count values. The kernel will pass through the map, storing the keys and associated values for a maximum of that many elements, and setting count to the number actually returned. Setting in_batch to NULL starts the lookup at the beginning of the map; the out_batch value can be used for subsequent calls to pick up where the previous call left off, thus allowing traversal of the entire map.

Update and delete operations expect keys to contain the keys for the map elements to be affected. Updates also use values for the new values to be associated with keys.

The batch operations do not eliminate system calls for access to map elements, but they can reduce those calls considerably; one call can affect 100 (or more) elements at a time rather than just one element. The batch operations do have some significant advantages over memory-mapping; for example, they can be used for any map type, not just array maps. It is also possible to perform operations (like deletion) that cannot be done with memory-mapping.

There is thus a place for both approaches. This patch set is in its third revision, having picked up a number of reviews and acks along the way, so it, too, seems likely to be merged in the near future.

Index entries for this article
Kernel	BPF
Kernel	Retpoline

A medley of performance-related BPF patches

Posted Jan 3, 2020 9:24 UTC (Fri) by scientes (guest, #83068) [Link] (1 responses)

Why don't we rename Linux to BPF-kernel?

A medley of performance-related BPF patches

Posted Jan 3, 2020 11:07 UTC (Fri) by darwi (subscriber, #131202) [Link]

> Why don't we rename Linux to BPF-kernel?

A heavy reliance on BPF is not a bad thing in and out of itself.

It makes me remember the GCC scenario: it made pluggability quite hard, on purpose, due to a political fear of plugins and GNU losing control. Meanwhile, LLVM, heavily encouraged extensibility: static analysis, R&D in compilers and programming-languages, etc.

In around 10 years or so, a lot of new programming languages and static analysis tools began using LLVM *exclusively*, and GCC has taken a back seat.

TLDR: no need to hate BPF here. It's encouraging a lot of R&D in the kernel and surrounding plumbing layer. Let the technology progress freely...

A medley of performance-related BPF patches

Posted Jan 15, 2020 14:03 UTC (Wed) by anton (subscriber, #25547) [Link] (3 responses)

Actually, correctly predicted indirect branches are relatively cheap on modern CPUs. Unfortunately, I cannot compare them to direct branches easily (I have seen a case where replacing an indirect call by a direct call had no performance effect, though), but I can compare them to not branching at all: In Gforth with dynamic superinstructions (the default if you build it yourself), you have a kind of JIT compiler that puts one code fragment after the other. If you call it with --no-super (and --ss-number=1 to disable a conflicting optimization), you get two moves and an indirect jump inserted between two consecutive fragments, with each of these jumps jumping directly to the next fragment. Here are the results for

perf stat -e cycles:u -e instructions:u \
-e cpu/event=0xc4,umask=0x20,name=br_inst_retired_near_taken/u \
-e cpu/event=0xc5,umask=0x20,name=br_misp_retired_near_taken/u \
gforth-fast --ss-states=1 onebench.fs
perf stat -e cycles:u -e instructions:u \
-e cpu/event=0xc4,umask=0x20,name=br_inst_retired_near_taken/u \
-e cpu/event=0xc5,umask=0x20,name=br_misp_retired_near_taken/u \
gforth-fast --ss-states=1 --no-super onebench.fs

on a Core i5-6600K (Skylake):

  default      --no-super
1,337,788,988  1,859,774,730  cycles:u                                                    
3,602,683,259  5,132,418,005  instructions:u           
  177,471,986    646,173,626  br_inst_retired_near_taken                                   
    5,677,645      7,353,850  br_misp_retired_near_taken

As you can see, the --no-super variant has many more (correctly predicted) indirect branches, and also more cycles, but only 1.1 additional cycles per additional indirect branch (and these cycles also cover the two additional moves per indirect branch).

Concerning BPF-kernel, Greenspun's tenth rule comes to mind.

A medley of performance-related BPF patches

Posted Jan 18, 2020 9:59 UTC (Sat) by Wol (subscriber, #4433) [Link] (1 responses)

> Actually, correctly predicted indirect branches are relatively cheap on modern CPUs.

And therein lies the problem you have missed. Predicting branches is a variant of the Spectre attack. Plus, "relatively cheap" doesn't preclude the possibility of something else being cheaper.

Indirect branches are now deprecated because (a) they are easy to use in an attack, and (b) we've found something cheaper.

Cheers,
Wol

A medley of performance-related BPF patches

Posted Jan 18, 2020 11:40 UTC (Sat) by anton (subscriber, #25547) [Link]

Predicting branches is a variant of the Spectre attack.

More precisely, Spectre variants work by getting the program to mispredict branches, and combining that with a side channel.

Plus, "relatively cheap" doesn't preclude the possibility of something else being cheaper.

Yes, obviously straight-line code is cheaper, as I demonstrated above. Concerning the proposed conditional branch trees, a correctly predicted taken branch takes 1 cycle on Skylake, a not-taken conditional branch can take 0 cycles (but the Skylake can only process two branches per cycle), so with all predictions correct, the tree takes at least one cycle, and with 64 targets (i.e., a 6-deep conditional branch tree), it takes 3-6 cycles. By contrast, a correctly predicted indirect branch costs 1 cycle. The claim in the original article that "indirect branches were never entirely fast" is nonsense.

If we assume a worst-case branch prediction accuracy, the indirect branch costs one misprediction (~20 cycles), while the 6-deep conditional branch tree costs three mispredictions on average (50% prediction accuracy), for a total of ~60 cycles.

It is possible that for in-between branch prediction accuracy, the conditional branch predictor may work so much better than the indirect branch predictor, that the end result is better for the tree. But I would have to see measurements before I believe it. I have read the patch announcement, but the information provided there is insufficient for me to understand what was measured, and what the numbers mean.

Indirect branches are now deprecated because (a) they are easy to use in an attack, and (b) we've found something cheaper.

I have yet to see an instruction set manual that says that indirect branches are deprecated. As for the claimed reasons:

a) If they were easy to use in an attack, we would hear a lot about such attacks in the wild. But sure, in some environments (e.g., in the kernel), you do not want to use indirect branches and replace them with something else. That does not mean that indirect branches are deprecated in general.

b) No, you have not, as discussed above. But I invite you to substantiate your claim by replacing the indirect branches in Gforth with "something cheaper", and measure the result.

A medley of performance-related BPF patches

Posted Jan 25, 2020 1:31 UTC (Sat) by Shabbyx (guest, #104730) [Link]

I didn't read well enough to understand why there's a jump, but this gave me an idea.

Why not put all bpf programs attached to the same point all sequentually in memory? That way when one program finishes, the next instruction would naturally be the start of the next program. No jumps involved.

Obviously removing bpf programs would involve a shift.