Writing network flow dissectors in BPF

September 6, 2018

This article was contributed by Marta Rybczyńska

Network packet headers contain a great deal of information, but the kernel often only needs a subset of that information to be able to perform filtering or associate any given packet with a flow. The piece of code that follows the different layers of packet encapsulation to find the important data is called a flow dissector. In current Linux kernels, the flow dissector is written in C. A patch set has been proposed recently to implement it in BPF with the clear goal of improving security, flexibility, and maybe even performance.

Flow dissection in the kernel

Flow dissector usage is common in high-speed networks, but it is not limited to that use case. Information from the flow dissector may be used by a firewall or a traffic filter, or in any other situation where complete packet parsing is not necessary. For example, just the source IP address or UDP port number may be enough for the kernel to decide on the action to perform on a packet. The flow dissector extracts just the values the caller has requested from the headers; those values are called "keys". The Linux implementation of the flow dissector is not the only one; Wireshark, for example, has its own version of packet dissection.

Let us show packet dissection on an example of extracting fields from a UDP packet; in this case, it has been asked to extract two keys: the IP addresses and the UDP port numbers. The dissector starts at the Ethernet layer to check whether the packet contains an IP header directly or if there are VLANs or other encapsulations to deal with. At the IP level it will save the source and destination addresses, along with the protocol field, which determines the type of the payload protocol. Assuming that the protocol is UDP, the dissector will look for the source and destination port fields, respectively, and save their values. At that point, the requested keys have been found and the dissector's work is done. Note that the real dissector will be more complicated; it will also follow other encapsulation cases (if the UDP header does not directly follow the IP header, for example), take into account the packet fragmentation, and possibly save more keys.

Linux currently includes one built-in flow dissector; the flower classifier [PDF] is the main user. The idea of using a flow dissector based on BPF was raised back in 2017 [PDF]. Petar Penkov and Willem de Bruijn have recently proposed to add a flow dissector written in BPF for the receive path.

Switching to BPF

Using BPF for flow dissection adds a number of interesting security features. Certain types of vulnerabilities become impossible because BPF programs are guaranteed to terminate and thus won't go into an infinite loop. The submission mentioned CVE-2013-4348 as an illustration of the type of problem that can be avoided. That vulnerability consisted of the flow dissector entering an infinite loop when using IPIP encapsulation in cases where it received a packet with a tiny header length value.

Additionally, in BPF, all memory accesses are checked, so it is not possible for a program to read outside of the packet bounds. Another possible feature of the new flow dissector is allowing the administrator to disable the dissection of packets containing unused protocols, reducing the number of possible attacks. If a bug is discovered, the administrator can remove the faulty code from the dissector without a kernel recompilation or even a reboot. Or, they can add their own specialty dissector if needed.

The proposed patch adds the BPF execution in __skb_flow_dissect(); each network namespace can have its own dissector. The submission includes an example program to illustrate the concept. It already handles most of the protocols needed, including multiple levels of IP encapsulation, VLAN, and IPv4/v6 extension headers.

The possible return values of the BPF flow dissector were the subject of a discussion. Currently the program returns BPF_OK (if dissection has completed successfully) or BPF_DROP (if the dissector came to a conclusion that it is necessary to drop the packet). Song Liu asked if there should be a separate value to allow fallback to the C implementation if the BPF program encounters a protocol it does not support. Penkov answered that fallback to C would defeat the security guarantees provided by BPF. De Bruijn explained further that this idea had been discussed, but they had decided against it. The goal of the BPF flow dissector is to replace the built-in version; falling back to the C dissector would make reaching this goal harder because users would quickly come to depend on this behavior.

The location of the hook for the BPF program was also discussed. It would be possible to add it to the XDP (eXpress Data Path) hook instead, for example. The authors of the proposal did not go that way for multiple reasons. The first reason is that it would be more expensive — the XDP hook executes before GRO (generic receive offload), so the dissector would run more often. The XDP hook also runs before the SKB structure to hold the packet is allocated, and there is no easy way to move the dissected keys to the SKB afterwards. It could be possible to implement both flow dissection and GRO in a single pass, but that would require much more flow state to work.

In the first version of the submission, the definitions of the various structures used to hold flow-dissector keys were copied into the BPF program itself, since they are otherwise not visible to user space. Alexei Starovoitov noted that everything used by a BPF program becomes part of the user-space API; he suggested three solutions: moving all the key structures to the user-space headers, wrapping all of them into a separate structure and modifying the code when the internal ones change, or waiting for BTF to solve the issue. The last option was eliminated since BTF is not ready for this kind of use yet. As networking maintainer David Miller also supported the wrapping option, this is the solution the second submission used: struct bpf_flow_keys contains all of the supported keys so that they are available to the BPF program.

The first version of this patch set ran into an interesting problem: since the offset at which to start dissection in a packet is supplied by the caller, the BPF verifier cannot ensure that accesses using that offset will remain within bounds. So relatively slow accessors had to be used to get at packet data. Daniel Borkmann suggested a simple trick to get around this issue: the BPF program need only check that the offset is in bounds at the beginning; after that, the verifier can prove that subsequent uses will remain within bounds. That change improved the performance of BPF dissectors to be comparable with the in-kernel dissector.

Current state

A performance evaluation is included with the submission; it compares the BPF flow dissector with the in-kernel dissector and the no-op dissector on a UDP flow. The speed of the dissectors is similar, with the BPF one performing slightly better than the in-kernel dissector. More evaluations will probably follow, but the results are already encouraging.

The BPF flow dissector already integrates with the flower classifier since it uses the same interface. The tests included with the patch set show this integration: the demonstration drops packets from a specified UDP source port in the scenario we covered at the beginning of the article. As flower uses the flow dissector to match flows, dropping the right ones shows the right dissection.

The BPF flow dissector is another part of an increasingly BPF-based network processing model that includes XDP (covered here previously). The goal of the BPF flow dissector is ambitious: to replace the built-in one. Time will tell if it will succeed. Before that, the patch set received positive comments and it seems likely that it will be included in the near future.

Index entries for this article
Kernel	BPF/Networking
Kernel	Networking
GuestArticles	Rybczynska, Marta

Writing network flow dissectors in BPF

Posted Sep 6, 2018 22:15 UTC (Thu) by mtaht (subscriber, #11087) [Link] (11 responses)

The flow dissector is also heavily used by the various fq based qdiscs.

I still have a hard time believing an ebpf version would be even close in speed to the C version... but... cool.

Writing network flow dissectors in BPF

Posted Sep 7, 2018 9:26 UTC (Fri) by grawity (subscriber, #80596) [Link] (10 responses)

As long as it's compiled to similar machine code (iirc, ebpf is compiled on load) then there's no reason it shouldn't run at a similar speed, is there?

Writing network flow dissectors in BPF

Posted Sep 7, 2018 17:45 UTC (Fri) by mm7323 (subscriber, #87386) [Link] (9 responses)

Years of work on C compilers makes them pretty good both at higher level transformations and low level machine specific optimisation.

I'd be surprised if this could easily be beaten or quickly replicated by the eBPF compiler.

The only case I see BPF winning is if the BPF program is produced and compiled specifically for the task in hand and thus omits a lot of option checking and branches that are not used in the specific dissector. This could well be what's happening here.

Writing network flow dissectors in BPF

Posted Sep 7, 2018 19:11 UTC (Fri) by wahern (subscriber, #37304) [Link] (5 responses)

The nature of the BPF instruction set and virtual machine also makes it easier for a compiler to elide rote conditionals and branches, such as related to bounds checking, regardless of the scope and SLoC of the policy logic. This can be accomplished in C, but the language and environment have many more degrees of freedom so it requires careful and deliberate use of data structures[1] and composition[2] such that the compiler can readily ascertain and prove the variants through complex conditionals, type conversions, and deep call chains. Such careful design is normally lacking in typical C software stacks, including Linux kernel networking code, and its one of those aspects of programming (like good error handling, OOM recovery, or concurrency modeling) that demands strict discipline across the code base from the very outset.

A BPF compiler should have an easier time optimizing data access because of the stricter invariants baked into the design; and because these invariants demand greater discipline from the outset. C compilers are super sophisticated and the eBPF compiler quite naive, but we overestimate the ease of proving constraints in general purpose code and underestimate the dividends from code underpinned by strong, natural invariants.

I think this is one aspect of BPF that indisputably gives it a leg up, notwithstanding the abusive use of BPF. I've been critical of the shift to using BPF and eBPF as a green field for reinventing subsystems that are more maintainable and transparent using the normal C kernel interfaces and architecture. From a functional perspective I'm suspicious of heavy BPF usage, but I can't deny the technical potential.

[1] For example, using fixed-sized arrays...
[2] ...and passing them in a way that access bounds remain easily statically provable despite C's pointer decay semantics. (E.g. keep pointer derivations in lexically close proximity to parameter declarations and variable definitions containing the array.) Relying on the compiler to hoist bounds checking out of loops has proven spotty at best. Like JIT optimizations and performance claims, it's theoretically workable but difficult for compilers to resolve consistently. Plus, such hoisting can be done more reliably in something like BPF because there are more invariants enforced from the outset.

Writing network flow dissectors in BPF

Posted Sep 7, 2018 20:32 UTC (Fri) by Paf (subscriber, #91811) [Link]

I don't know BPF at all, but I can support the idea that it is remarkably difficult to prove constraints in C.

In general, in high performance computing situations, one reason Fortran is sometimes preferred is that it does not have pointers (in practical terms - they do exist but are very little used). A pointer in C can point *anywhere*, so unless the compiler can trace its entire lifecycle, they can make lots of otherwise relatively straightforward optimizations very hard. Pointers can also point in to the middle of what you'd like to have as unitary data types, again preventing certain optimizations. So in general, Fortran code is more easily optimized than C. (There are modified dialects of C that implement some additional constraints for this same reason (C11 has some effort in this direction, I believe).)

That specific example might not apply here, but is hopefully interesting.

Writing network flow dissectors in BPF

Posted Sep 7, 2018 22:25 UTC (Fri) by mm7323 (subscriber, #87386) [Link]

Indeed, it has often been said, and it is logically true, that higher level languages with stronger typing and more descriptive constructs should allow for better compiler optimisation and faster execution.

Often that's not the case as code can be a mess in any language- and though this may also be a situation where the C is aged and not well optimised.

The security benefits of eBPF make this initiative a good thing though. And if anything, further use of eBPF should help drive optimisation efforts in the compiler/assembler.

Writing network flow dissectors in BPF

Posted Sep 8, 2018 15:08 UTC (Sat) by mtaht (subscriber, #11087) [Link] (1 responses)

I've heard these sorts of defenses of jitted code over and over again for 30 years. Please forgive me for being dubious as to any advantages in real world applications on anything other than toy benchmarks.

Writing network flow dissectors in BPF

Posted Sep 11, 2018 17:14 UTC (Tue) by jezuch (subscriber, #52988) [Link]

Java, a JITted runtime, actually is pretty fast. What drags it down in real world workloads is garbage collection pauses. There are techniques that avoid allocation, and thus the GC, but barely anyone uses them.

And the Graal compiler available in newer OpenJDK promises to show what a JIT is really capable of, because the current compiler is so old and complex that progress in it ceased a long time ago. Early adopters of Graal report even 20% improvements, if I remember correctly.

Writing network flow dissectors in BPF

Posted Sep 11, 2018 9:09 UTC (Tue) by miquels (guest, #59247) [Link]

> [1] For example, using fixed-sized arrays...
> [2] ...and passing them in a way that access bounds remain easily statically provable despite C's pointer decay semantics.

Found this posting by Walter Bright from 2009. Too bad no C compiler has implemented something like it (AFAIK): C's Biggest Mistake.

Writing network flow dissectors in BPF

Posted Sep 8, 2018 15:03 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (1 responses)

Higher level transformations can still be done when compiling to eBPF, since they are done before instruction selection. So LLVM does the first round of optimization and emits eBPF, while the kernel does bytecode verification and x86 instruction selection. Where a regular compiler has an edge is in register allocation and target-dependent loop optimizations (optimizing induction variables, choosing addressing modes, etc.).

Writing network flow dissectors in BPF

Posted Sep 13, 2018 11:45 UTC (Thu) by nix (subscriber, #2304) [Link]

Are loops permitted in eBPF now? I mean, there's no reason loops that the verifier can prove terminate might not exist in eBPF, but last I knew they were still prohibited (in check_cfg()). (... I just checked, and that's still true in master, and in -next.)

In the absence of loops I don't see that target-dependent loop optimizations are going to be particularly relevant (except, I suppose, for complete unrolling.)

Writing network flow dissectors in BPF

Posted Sep 11, 2018 17:06 UTC (Tue) by jezuch (subscriber, #52988) [Link]

> Years of work on C compilers makes them pretty good both at higher level transformations and low level machine specific optimisation.

A C compiler is like a cow. It's big and complex (how many stomachs?) because it has to be if it wants to stand a chance extracting anything of value from the extremely low quality food it takes.

Assembler, on the other hand, is like a dog. It's much simpler because transition from meat to meat is kind of easy.

But C? Nope. It has to work *hard*.

Writing network flow dissectors in BPF

Posted Sep 12, 2018 11:27 UTC (Wed) by dbkm11 (guest, #125598) [Link]

"Linux currently includes one built-in flow dissector; the flower classifier [PDF] is the main user." It's one user but /not/ the main one. Calculating the skb->hash is done using the flow dissector and this is done for every packet in the stack, and its something that cannot be compiled out, hence the target to reduce attack surface there.