Modernizing BPF for the next 10 years
BPF was first generalized beyond packet filtering more than a decade ago. In that time, it has changed a lot, becoming much more capable. Alexei Starovoitov kicked off the second day of the BPF track at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit by leading a session discussing which changes to BPF are going to come in the next ten years as it continues evolving. He proposed several ideas, including expanding the number of registers available to BPF programs, dynamic deadlock detection, and relaxing some existing limits of the verifier.
Starovoitov started with a recap of the last ten years of BPF development. BPF's initial use case was for networking — hence the name "Berkeley Packet Filters". In 2015, this expanded into a new generation of tools, including using it for tracing. Everything in BPF has evolved for a reason, he said. Once support for BPF existed in the kernel, user-space tools like katran and Cilium started popping up to take advantage of it.
![Alexei Starovoitov [Alexei Starovoitov]](https://static.lwn.net/images/2024/alexei-starovoitov-small.png)
As tracing tools evolved, they needed access to kernel headers in order to understand how the data they were seeing was laid out. This led to things like the BPF Type Format (BTF), "compile once — run everywhere" (CO-RE), and teaching Clang how to do symbolic field accesses instead of using a known offset, he explained. Other BPF usability features were also introduced by necessity.
In C, the natural representation of global variables is to store them in .data sections (or .bss, .rodata, etc.). For BPF programs, however, it often makes sense to share global variables with user space for configuration or reporting purposes, which means that data needs to be stored in a BPF map. Even if the globals are stored in a BPF map, however, userspace still might not know how they are arranged. The solution is BPF's skeleton support that generates types that match how the compiler lays out global variables, making them easier to access programmatically.
The way BPF code is written has also changed a lot. Initially, the verifier had no support for function calls — meaning that programmers had to annotate all of their functions with always_inline. That is no longer necessary, but many users don't seem to know that because of the variety of old examples floating around the internet. "I feel we need to do something in terms of evangelizing the better practices," Starovoitov said. Loops are in a similar situation; they were not supported initially, but over time have become increasingly usable. Yet there are still patches submitted to the mailing list today that use bpf_loop(), even though open-coded iterators (a better replacement) were added two years ago.
The interfaces used by BPF code have also seen some changes. The initial mechanism for BPF to call into kernel code was helper functions. These are considered a stable interface, so it's hard to experiment with new possibilities. They've been supplemented by kfuncs, which are an unstable interface that can be augmented from anywhere, including kernel modules. At the time of Starovoitov's talk, there were 211 helper functions and 164 kfuncs in the kernel.
For kernel code calling into BPF, the same flexibility is offered by the struct_ops mechanism. The feature was first introduced for sched_ext, but has turned out to have many users across the kernel. Implementing the feature did take some time, Starovoitov said, because the necessary translation to and from the BPF calling convention is not trivial. But overall, struct_ops has been an incredibly useful feature.
It is also not the only change to BPF driven by sched_ext; the necessary integration with the scheduler means it drove many features that improve data sharing between the kernel and BPF programs. The most notable is perhaps kptrs, which are direct pointers to kernel data structures that rely on the BPF verifier to track ownership and lifetime information.
The general trend has, of course, been toward making BPF programs more capable. The most recent changes that Starovoitov discussed were BPF arenas and cond_break, which represent a big step toward BPF being able to implement arbitrary algorithms and data structures. Adding these means that extending BPF no longer needs as many kernel changes, and it also turns a lot of static verification into runtime verification, Starovoitov said. Now that these are in place, there will start to be more BPF libraries. Libraries for regular expressions and hash tables already exist — a string manipulation library is probably next.
The future
Right now, library code is reused between BPF programs using the oldest library-management technique: copy-and-paste. That needs to change, he said. BPF developers need some way to distribute shared BPF code as libraries — ideally, an approach modeled on Rust or Python, where libraries are distributed as source only. C and C++ don't really handle dependency management well, and BPF should learn from their mistakes.
Not every desirable library can be written in BPF today; there are some additional enhancements necessary to enable truly arbitrary algorithms. sched_ext can't do everything users might want — notably, reimplementing EEVDF (the current default Linux scheduler) — because of the current limits on locks. Only one lock can be held at a time, and the program cannot call kfuncs while it is held. Worse, bugs in the verifier code that ensures this property could potentially cause deadlocks. What BPF needs, Starovoitov asserted, is another line of defense, so that locks taken by BPF programs (or the core infrastructure of the BPF VM) can't interfere with the kernel. Once that is possible, run-time deadlock detection can be implemented, and then it will become possible to relax the restrictions the verifier currently puts on locks.
Many people see relaxing the verifier's requirements as something necessary to make BPF Turing-complete but contrary to popular belief, it already is, he said. BPF arenas are almost the last feature required to demonstrate this fact by implementing an interpreter in BPF. The only missing part is support for jump tables and indirect gotos. Writing an interpreter in BPF is a "toy motivation" that he does not actually expect anyone to want to do, Starovoitov clarified, but jump tables are something BPF will need to remain relevant.
Related to indirect gotos is support for tail calls. BPF does already have bpf_tail_call(), but Starovoitov called that a hack, saying it was cumbersome to use. A cleaner solution would be to use a dedicated instruction for indirect calls — which BPF has actually had since the beginning, as long as the call targets are global functions. The missing component is verifier support; the verifier needs to be changed to understand the concept of "the address of the instruction".
Efficiency
Even if there is little else needed to make BPF more flexible, there are plenty of things that could make it more efficient, such as dedicated bit-manipulation instructions or changes to the calling convention, Starovoitov explained. In particular, a function attribute that can be used to mark functions that don't need to use caller-saved registers could let compilers be smarter about register allocation and reduce the number of registers spilled to the stack. Another possible idea for better register allocation is just to increase the number of available registers. When BPF was first being developed, x86 was the dominant architecture under consideration; it does not offer many registers compared to other architectures. For modern BPF, the Arm and RISC-V architectures are important contenders — but BPF programs can't take advantage of the larger number of available registers.
Starovoitov mentioned a few ways that the BPF developers could approach the problem, such as switching to virtual registers and doing register allocation in the kernel. Other possibilities include rejecting programs that use too many registers for a given architecture, fat binaries compiled for different numbers of registers, or having the verifier track spills to the stack and convert them to registers when possible. David Vernet pointed out that BPF's instruction encoding only has 4 bits for registers, so using more than 16 registers would be difficult. Starovoitov replied that new instructions could potentially add more space for registers, noting that he feels that the restriction on the number of registers is showing its age.
Increasing the number of available registers could also open the door to passing six or more arguments. Right now, BPF has a maximum of five arguments per function, because only five registers are used for passing arguments. That could be worked around by using the stack, but extra registers would be an easier solution. The key constraint around changing the BPF calling convention is making sure it can be efficiently mapped to the kernel's calling convention, so the right solution is not immediately obvious.
More uses
Starovoitov mentioned one final pie-in-the-sky idea for better interoperability between BPF and the kernel: with extra registers and an expanded calling convention, it might be feasible to compile the kernel to the BPF ISA. That would open the door to a number of previously unimagined applications, such as using BPF debugging across the whole kernel.
There are other more feasible improvements, though, such as permitting alloca(). BPF programs are currently restricted to a stack of 512 bytes, which makes using alloca() impractical, since there is not much space available for allocations. While it may be possible to expand the size of the BPF stack, another solution is to use a "divided stack" that tracks return addresses and local variables on separate stacks. The 512-byte stack that the kernel is aware of could then be saved purely for function calls, with a larger stack allocated (perhaps in a BPF arena) on the fly.
Vernet questioned how desirable alloca() was in BPF programs, noting that it creates a bunch of extra instructions compared to using a static allocation — and, in fact, alloca() is forbidden in the kernel for that and other reasons. Starovoitov acknowledged that it was forbidden in the kernel, but didn't think that was relevant to BPF programs. alloca() is much cheaper than a heap allocation, and can be guaranteed to succeed. BPF programs might find it useful for holding structures that vary in size depending on the size of kernel structures, a complication that comes up fairly frequently.
Of course, all of this future flexibility and dynamism will come at a cost. "Not everything can be done statically," Starovoitov said. Making BPF programs safely cancellable, with run-time timeouts to augment program verification, will probably become necessary. Starovoitov said that work was already in progress. Debugging all of these new features is not likely to become a problem, however, because BPF already has good observability. BTF debugging info is a good match for C code (and kernel code), and existing tools like bpftrace use it to great effect. One thing that is missing is letting these same observability tools work in user space. Starovoitov noted that it is a chicken-and-egg problem: user-mode BPF probes are not fast, so why bother supporting them? But they won't become fast without additional support. Also, many languages used in user space don't match BTF semantics as well, which may require changes to the format.
A potential side effect of making BPF more capable is making more work for the verifier. Currently, the verifier has a one-million-instruction limit, just to bound the amount of time it will spend on a pathological BPF program. For large programs, however, any verifier or compiler change could potentially make it hit the limit, which is a "horrible user experience". There is no real solution yet, Starovoitov said, but the problem is something that BPF developers must consciously focus on in order to fix. There are some workarounds, such as testing compiler changes in the BPF CI. Another possible solution would be to relax the limit, and let the verifier go beyond one million instructions if it can tell that it is making forward progress.
The last idea Starovoitov introduced was making BPF into a kernel module. He noted that the main problem for vendors shipping products that need BPF is the variety of different kernels in use; it is difficult to make a BPF program work across all of them. One potential solution, he explained, would be to make it possible to upgrade the BPF subsystem independently of the base kernel. That would, itself, be a significant challenge — the existing kernel module mechanism is not enough to support it — but it could solve some persistent problems.
Questions
Starovoitov finished by saying that he thought BPF's next growth areas were likely to be in Linux security modules (LSMs), other security use cases, and continuing improvements to sched_ext. The audience had several questions. One person asked whether Starovoitov thought BPF was harder or easier to use now, after a decade of changes. Starovoitov replied that the BPF developers have made it much easier to use, but that they couldn't actually simplify some of the core design behind BPF because of stability guarantees, so it wasn't as easy to use as it could be. Another person asked whether Starovoitov expected BPF to see growth outside the kernel. Starovoitov replied that the main power of BPF is in observability and safety. There may be use cases that call for that combination outside the kernel, but user space as a whole would not benefit from BPF, he said.
Another member of the audience asked about how to communicate all of this to the many BPF users who were not present at LFSMM+BPF; they suggested doing outreach to other conferences to help spread some of these ideas. Starovoitov said that he agreed completely that this was a good idea, but that he is not specialized in doing evangelism. He called on the other people present in the session to help spread the word.
BPF has grown remarkably quickly, and there's no particular sign that it will stop. Many users are finding value in the greater observability and configurability it brings to the kernel. At the same time, it is clear that there are still big plans to change BPF. It may look quite different in another decade.
Index entries for this article | |
---|---|
Kernel | BPF |
Conference | Storage, Filesystem, Memory-Management and BPF Summit/2024 |
Posted Jun 7, 2024 16:00 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (8 responses)
So... Why not WASM?
Like really. At this point, you've abandoned ALL the original reasons for BPF.
Posted Jun 7, 2024 22:12 UTC (Fri)
by python (guest, #171317)
[Link] (2 responses)
I find the path that BPF is taking to be a bit baffling. I feel like they are making a systematic effort to reinvent the wheel (admittedly their wheel is very well suited for certain specific purposes). It feels BPF program are becoming something akin to kernel modules, but wrapped in a special/quirky permission protected half-bpf-userspace-half-kernel thingy.
WASM, or just running (JIT?) compiled native instructions in a secure or virtualized environment seems like it would be a lot easier. If they wanted the BPF programs to have certain properties (like, not getting stuck in an infinite loop) it seems like the compiler* ought to be responsible for analyzing and enforcing that not the BPF instruction set.
*a cross between something like the TLA+ theorem prover and a standard compiler. Rust does something like this for analyzing and enforcing proper memory access. But it easily could be done for other program properties.
Posted Jun 8, 2024 18:15 UTC (Sat)
by edeloget (subscriber, #88392)
[Link] (1 responses)
I think I said this numerous times, but the kernel will still have to verify the program when it loads it. You do not want to allow arbitrary program to run in your kernel without proper checking first. Just like the web backend shall verify the user input, even if it's checked by the JS on the client side. It does not matter if you execute you web server in a super controlled environment, letting attacker-controlled values dance with your backend is not a good idea :)
On the more general WASM vs BPF discussion, I have the feeling that the two serve different purposes using similar technology and techniques. eBPF evolved from BPF (which, as I understand it, predates WASM by 'some time') and was devised by network-minded people who wanted to enhance the possibilities of the venerable filtering language. WASM was clearly not designed with this in mind. Up to very recently, eBPF was significantly inferior to WASM in terms of VM(1) possibilities, but on the other end ePBF was really a good fit for the kernel itself. The recent enhancement of eBFP makes WASM look like a good replacement, but pushing WASM in the kernel would mean:
* to replace the eBPF VM in the kernel by the WASM VM, and to replace all the eBPF call sites with WASM equivalents, which is going to be a daunting task ; while technicaly it seems to be feasible, noone has ever checked this, and their is a good possibility that some feature proposed by eBPF will not be implementable with an in-kernel WASM VM.
Doing this would likely require multiple kernel release cycles, and in the end you'll get the same features as now. The value of such a change is limited (and I'm pretty sure that may of you would find ti quite ridicule to see this development U-turn.
If you don't *replace* the VM but only add the WASM one, you'll have to maintain two similar VM in the kernel -- that does not look like a good idea to me (and the WASM VM would need years before it reach a state where it would be able to compete with the eBPF VM).
* to tell everyone to stop using eBPF ; that looks like it's going to be easier, but let me introduce you to one of my friends, Cobol.
In the end, does it really matter which VM is used? The kernel developpers are doing quite an incredible job with eBPF support, so WASM, at this point, would add nothing of value -- or, at least, too little to be really of interest. We also have compilers that spit out eBPF byte code, so we're not in need of "having to know yet another language".
(1) I use the term VM quite loosely here ; I just mean: "whatever is used to execute the eBPF or WASM code as well as its controlled environment".
Posted Jun 8, 2024 18:35 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link]
First, it's not needed. eBPF can stay in its current roles and for legacy stuff like the actual packet filter. And eventually you can just run eBPF on top of WASM. This is perfectly doable right now, there's nothing special in eBPF.
> In the end, does it really matter which VM is used?
Kinda yes. eBPF is becoming a generic VM, but built on top of a hacky NIH-ed instruction set that is a very poor fit for general purpose programming. And we'll be forever saddled with it, and all of its eventual bugs and misfeatures.
Posted Jun 7, 2024 23:52 UTC (Fri)
by roc (subscriber, #30627)
[Link] (4 responses)
Posted Jun 8, 2024 2:25 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
At this point there's no conceptual between WASM and eBPF. The eBPF runtime tries to be more resilient to timing attacks (via constant blinding), but that's a property of runtime. It can be added to WASM runtimes.
Posted Jun 8, 2024 11:15 UTC (Sat)
by aszs (subscriber, #50252)
[Link]
Posted Jun 10, 2024 7:31 UTC (Mon)
by roc (subscriber, #30627)
[Link] (1 responses)
I think the closest WASM equivalent of this is reference types, which are pretty new. I don't see instructions for doing field load/stores via references yet.
Posted Jun 10, 2024 17:11 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
I'm also going to bet, that eBPF will grow pointer arithmetic that can re-interpret types in the future.
WASM?
WASM all the way.
WASM all the way.
WASM all the way.
WASM?
WASM?
WASM?
See, for example, https://dl.acm.org/doi/10.1145/3571208:
"memory-unsafe C code remains unsafe when compiled to Wasm—and attackers can exploit buffer overflows and use-after-frees in Wasm almost as easily as they can on native platforms."
(And I just noticed one of that paper's authors is also a contributor to the Wasm spec)
WASM?
It's not well documented and all very ad-hoc, but this is not the usual WASM array-of-bytes model.
WASM?