|
|
Subscribe / Log in / New account

BPF: what's good, what's coming, and what's needed

By Jonathan Corbet
May 9, 2019

LSFMM
The 2019 Linux Storage, Filesystem, and Memory-Management Summit differed somewhat from its predecessors in that it contained a fourth track dedicated to the BPF virtual machine. LWN was unable to attend most of those sessions, but a couple of BPF-related talks were a part of the broader program. Among those was a plenary talk by Dave Miller, described as "a wholistic view" of why BPF is successful, its current state, and where things are going.

Years ago, Miller began, Alexei Starovoitov showed up at a netfilter conference promoting his ideas for extending BPF. He described how it could be used to efficiently implement various types of switching fabric — any type, in fact. Miller said that he didn't understand the power of this idea until quite a bit later.

What's good now

BPF, he said, is well defined and useful, solving real problems and allowing users to do what they want with their systems. It is increasingly possible to look at what is going on inside the kernel or to modify its behavior without having to change the source or reboot the system. BPF provides strict limits on what programs can do, running them inside the kernel but in a sandboxed mode. BPF programs operate on a specific [Dave Miller] object (called the "context"), and there are many places to attach them. They execute in finite time; something that is assured because they are not currently allowed to contain loops (though that will change to some extent eventually). BPF maps provide data structures for programs and can be used to share access to data.

The BPF verifier, Miller said, is "the only thing between us and extreme peril". It is the last line of defense preventing dangerous code from getting into the system. It is so good, he said, that it often frustrates BPF authors, who have to massage their programs to get them to a point where the verifier will accept them.

The real value of BPF lies in the fact that "we are all arrogant". System designers tend to think that they know what their users want to do, so they make boxes to enable that one thing. But users don't want to be in a box; those users are a constant source of new ideas, and developers often don't know what they want. BPF allows users to escape the box created by system designers, who may be sad about that, but they'll get over it.

BPF has been growing slowly, by word of mouth, Miller said, because there is no "advertising machine" for this technology. Users are still learning about it. The good news is that, once technical people get into a new technology, they tend to spread it around. That has happened with BPF, to the point that people are now writing books about it.

What's improving

BPF is mostly useful now for solving simple problems, Miller said, but it is rapidly gaining the ability to deal with "real programs". One step in that direction is increasing the size limit for BPF programs from its current value of 4096 instructions to 1 million. The prohibition on loops forces developers to unroll loops in their programs, which is unfortunate; the in-development support for bounded loops will fix that problem someday.

BPF programs can perform tail calls now; they function a lot like continuations. Tail calls are a great way to build an execution pipeline, where each step in the pipeline performs a tail call to the next. But now BPF is able to support real function calls as well, but a given program still cannot use both due to limitations in the verifier.

One area where BPF has seen some improvement is introspection; it can be hard for developers to understand why their program is not doing what they want. Indeed, in current kernels, it's hard even to determine which BPF programs have been loaded into the system, or to verify that a loaded program is the one that is wanted. The bpftool utility is improving in its introspection support, as is the BTF format for describing data types, which will help to increase the portability of BPF programs. BTF turns out to be good for annotating BPF programs and how they work. The perf utility is also gaining the ability to drill down into BPF programs. Users cannot complain, Miller said, that visibility into BPF is not being provided.

What's needed

There is no shortage of opportunities for improvement still, he said. For example, BPF does not currently support code reuse all that well; there are a lot of people out there writing their own Ethernet header parsers. There are systems with thousands of redundant BPF programs loaded into them; that is not the way to do software development. Support for function calls will help, but BPF needs libraries, and it will need access control for those libraries once they can be loaded. BTF will help, since it will make it easy to see which libraries are available in a given system.

BPF development is still harder in general than it should be; Miller would like to see the development of a "type and go" environment that makes writing and loading a BPF program as easy as on the Arduino. Unskilled people should be able to get stuff done; that is part of the goal of wresting control away from arrogant system developers.

BPF programs should have "trivial debuggability", he continued. It should be possible to single-step through programs and examine context data. He would like the ability to record a program's execution or state so that it could be stepped through outside of the kernel. Perhaps even live, in-kernel single-stepping could be supported in development environments. The most important thing for the near future, though, is the ability to snapshot the current state of a BPF program.

Finally, he said, BPF needs better access control. Almost all BPF functionality is root-only now, but things will not be that way forever. Much more granular control to BPF functionality is required — or we could always control access to BPF with a BPF program, he said. A file like /dev/bpf could be used for access control, but that's still pretty coarse; perhaps what is needed is a hierarchy of files describing the different program types and their access permissions. BPF also needs better memory accounting, since maps can get quite large.

At this point, Miller concluded his talk and accepted questions. Matthew Wilcox started things off by saying that he will not be impressed by BPF until it becomes possible to play Zork in the kernel. The original Zork was less than 1 million instructions, Wilcox said, so that should be possible.

ABI compatibility

The first actual question was about whether there are any inherent limits on what BPF will eventually be able to do. Early on, Miller answered, it was used for tasks like packet analysis, and current usage still reflects that. BPF will not be usable to implement a proprietary TCP stack in the kernel, for example; that is not a goal. Among other things, there are no timers available to BPF programs and no plans to add them.

Some people do try to push the limits, Miller said. Steve Hemminger tried to convert a packet scheduler to BPF, for example, but eventually ran into the timer issue. Somebody else, though, managed to create a complete implementation of OpenVSwitch, but that sort of project really misses the point of BPF. The real value is not in doing everything, but in being able to do exactly what you need and no more.

Ted Ts'o said that he did not expect to see device drivers in BPF, but Miller responded that those already exist. He was referring to the ability to perform infrared protocol decoding in a BPF program. That eliminates the need to support hundreds of infrared devices in a kernel driver and allows support for new devices to be easily added to older kernels. Ts'o conceded that point, but said that it was unlikely that there would be an NVIDIA GPU driver written in BPF anytime soon.

Another attendee asked about ABI compatibility; will the kernel have to support existing BPF programs forever? Miller responded that BPF exists in an "ambiguous plane" between kernel ABI and the "wild west" of the kernel's internals. Tools like BTF will help to make things more compatible over time. Meanwhile, the BPF developers have taken liberties to break things in the early stages; the community is still learning how all of this stuff should work. But that should happen less often over time. That said, he doesn't think it will ever be possible to write a BPF program and expect that it will work on every future kernel.

The discussion turned to the powertop episode, where a change to a tracepoint broke the powertop utility and had to be reverted. As a result of that, some maintainers still refuse to allow the addition of tracepoints in their subsystems. The problem is that powertop was useful, so users complained when it broke. BPF programs, too, will be useful, and are likely to suffer from the same problems. Brendan Gregg may have said earlier in the week that occasional breakage was OK, but someday some user will complain and Linus Torvalds will revert a BPF-visible change. Miller responded that, whenever a new facility like this is added, there is always a period in which things break. We'll never get away from that, but it will get a lot better.

Ts'o worried about how bad the ABI pain would be; some BPF interfaces will not be changeable, he said. At least, it will not be possible to change them without a ten-year deprecation period while old programs are fixed. Miller said that, with BPF, users are often happy when things break, because it usually indicates that new information is available for them to work with.

Gregg said that, in the absence of tracepoints, current BPF tools are using a lot of kprobes. There are a lot of kernel-version checks that go with them, but they still break with every kernel release. If the kernel moves to tracepoints that only break once every five years, that will be fantastic. Ts'o wondered whether the breakage of a kprobe-based tool that is seen as being as useful as powertop would cause Torvalds to revert a change. He does not know the answer to that.

Security

Dave Hansen asked about security and side channels; BPF was one way in which the Spectre vulnerabilities could be exploited early on. These issues have been mitigated one at a time as they are found, but has any thought been given to broader mitigations? Miller acknowledged that programs can be written to exploit speculative execution vulnerabilities; the verifier can often detect and block such attempts. On the other hand, BPF can also improve security. He mentioned an episode where a bug in a custom hash computation could be exploited to crash the kernel; it was possible to move the computation to BPF and block exploits until the kernel was fixed. Hansen continued, saying that the kernel-hardening efforts are trying to address problems proactively; work in the BPF area, he said, is more reactive. Miller conceded that point, but said that, hopefully, the kernel is becoming sufficiently hardened that it will no longer be necessary to worry about these issues all the time.

The final question came from Ts'o, who wondered about how BPF will interact with Linux security modules. With the advent of stackable security modules, it should be possible to implement more flexible access control for BPF programs. He also suggested that perhaps some verifier policies should include interaction with the security-module subsystem.

Miller answered that the verifier has a set of operations specific to each program type; it should be possible to add a security-module hook there somehow. He also observed, with amusement, that SELinux is using classic BPF now for a few things. It would be great, he said, to use BPF to create new security policies; it could be the "universal security policy engine". That would allow for the immediate addition of new policies without the need to wait for the next kernel release.

Index entries for this article
KernelBPF
ConferenceStorage, Filesystem, and Memory-Management Summit/2019


to post comments

BPF: what's good, what's coming, and what's needed

Posted May 9, 2019 18:20 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> "a wholistic view"
Ugh.

BPF: what's good, what's coming, and what's needed

Posted May 9, 2019 20:44 UTC (Thu) by roc (subscriber, #30627) [Link] (3 responses)

"Live, in-kernel singlestepping" makes no sense. Just have a way to record the inputs of a BPF execution, ensure BPF execution is deterministic, and then people can replay those executions and analyze them however they want without interfering with the operation of the system.

BPF: what's good, what's coming, and what's needed

Posted May 10, 2019 14:29 UTC (Fri) by Paf (subscriber, #91811) [Link] (2 responses)

“The inputs” Hmm, BPF programs measure in kernel state, that’s one of their main purposes. That state changes, obviously.

So this seems impossible? Recording the full execution might work, but I think nothing short of that would be sufficient, because “the inputs” are more than what the user provides.

BPF: what's good, what's coming, and what's needed

Posted May 10, 2019 16:04 UTC (Fri) by excors (subscriber, #95769) [Link]

I assume roc is thinking of https://rr-project.org/ , so "the inputs" means all data the program receives from its environment through any channel. For BPF I guess that means the initial context structure, plus the return values and BPF-memory side-effects of functions like bpf_probe_read (which copies kernel memory into BPF memory), but nothing more than that. Once you record those inputs, you should be able to deterministically replay the BPF program in a debug environment with identical behaviour.

For any code that's even vaguely timing-sensitive (which I assume includes nearly everything running in a kernel that's full of timers and timeouts and hardware interfaces), that's much more useful than a debugger that pauses the program while it's running.

BPF: what's good, what's coming, and what's needed

Posted May 10, 2019 22:04 UTC (Fri) by roc (subscriber, #30627) [Link]

For BPF "inputs" includes loads from memory outside the BPF program.

BPF: what's good, what's coming, and what's needed

Posted May 9, 2019 22:56 UTC (Thu) by dbkm11 (guest, #125598) [Link]

Jon, just a note for clarification with regards to ABI compatibility: The context with regards to ""ambiguous plane" between kernel ABI and the "wild west" of the kernel's internals" is on tracing where things can occasionally break due to kprobes nature or tracepoints getting potentially removed etc etc. BTF will for example help here since structure layout can be figured out for a running kernel by a loader or verifier and therefore member offsets adjusted automatically in future to a certain degree so that a tracing program can run on different kernels without a need for recompilation. For other types like networking programs, it's the same rules as syscall ABI meaning programs will keep running on newer kernels. Perhaps this can be clarified a bit better.

BPF: what's good, what's coming, and what's needed

Posted May 10, 2019 2:07 UTC (Fri) by SMK (guest, #131799) [Link]

(nearly) not a day goes by when i don't hear of one vendor or another telling me why BPF is important.

This is a great roundup of BPF session, that frankly i've not seen anywhere else...i'm particularly interested though in how groups like Cilium are (or aren't) contributing to help out with the missing bits...

BPF: what's good, what's coming, and what's needed

Posted May 10, 2019 16:04 UTC (Fri) by Bronek (guest, #120079) [Link]

Perhaps a non-Turing complete, simple language, such as Starlark might be a good high-level addition for writing BPF programs.

BPF: what's good, what's coming, and what's needed

Posted May 10, 2019 22:08 UTC (Fri) by roc (subscriber, #30627) [Link] (5 responses)

Has anyone involved got a vision for what BPF will look like in ten years? Is it going to be a full-fledged bytecode and VM for executing arbitrary code, i.e. like Webassembly with some extensions and some restricted modes? If so, in ten years are people going to be asking why both BPF and WebAssembly need to exist?

BPF: what's good, what's coming, and what's needed

Posted May 10, 2019 22:34 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

The max number of instructions has been increased to 10^6. So I predict we'll get a backbranch instruction in about a year.

After all, a million instructions is already more than some subsystems can afford.

BPF: what's good, what's coming, and what's needed

Posted May 13, 2019 23:37 UTC (Mon) by atai (subscriber, #10977) [Link] (1 responses)

when will the kernel be written in BPF?

BPF: what's good, what's coming, and what's needed

Posted May 14, 2019 0:30 UTC (Tue) by pizza (subscriber, #46) [Link]

I think we'll need a BPF compiler written in BPF first, as well as a BPF->C converter so the kernel can bootstrap itself..

BPF: what's good, what's coming, and what's needed

Posted May 16, 2019 3:06 UTC (Thu) by naptastic (guest, #60139) [Link] (1 responses)

What's a "backbranch" instruction? I thought I knew all the flow control primitives...

BPF: what's good, what's coming, and what's needed

Posted May 16, 2019 3:43 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Jump back if condition is fulfilled. It means that you can make loops with arbitrary number of iterations.

BPF: what's good, what's coming, and what's needed

Posted May 12, 2019 5:14 UTC (Sun) by alison (subscriber, #63752) [Link] (1 responses)

> A file like /dev/bpf could be used for access control, but that's still > pretty coarse; perhaps what is needed is a hierarchy of files
> describing the different program types and their access
> permissions.

Is there some reason why sysfs is not the obvious answer? Maybe /sys/kernel/debug/bpf, or /sys/kernel/bpf? That would certainly clarify the ABI stability question! The solution would be to keep new BPF interfaces in staging branch until we're sure we'll keep them.

BPF: what's good, what's coming, and what's needed

Posted May 12, 2019 19:12 UTC (Sun) by jkowalski (guest, #131304) [Link]

Yeah, either way a hierarchy of objects instead of a device node (and then having to use ioctl to lookup an object by name or something uglier) is a much nicer model (and one could argue should have been for a lot of other kernel interfaces), as it helps to use the descriptor to select the object in question, and access control can be done through the filesystem itself.

BPF: what's good, what's coming, and what's needed

Posted May 14, 2019 9:04 UTC (Tue) by ncm (guest, #165) [Link]

There is one thing I want added to the BPF library.

I would like to have a range of user memory mapped, and directly accessible, to an eBPF program in such a way that eBPF code can initiate block writes to that memory without any permission checks or translation steps beyond a simple range check.

In other words, I want to pre-clear an unlimited number of writes to a mapped user buffer. In use, this will be a multi-GB ring buffer, mapped as some thousands of hugepages. The hugepages need not appear consecutive to the eBPF code, so long as I can learn at startup where each hugepage is mapped in user space.

In actual use, the eBPF code would be executing inside the NIC, and the writes would amount to DMA operations over the PCI bus.

The prohibition on loops

Posted May 29, 2019 13:17 UTC (Wed) by Wol (subscriber, #4433) [Link]

What about a FORTRAN-style loop?

For those who don't remember them, the FORTRAN spec explicitly said that the index could be kept in read-only (as far as the programmer is concerned) memory. Modifying the loop index didn't necessarily affect the loop.

So a "DO II = 1 to 10" is guaranteed to execute ten times, with a monotically increasing II, even if the code inside the loop tries to modify II. (I think some implementations actually separated the loop index from the variable so the programmer could modify II but the new version wasn't used for the loop.) You could always add a statement like "SKIP II 5", which would move 5 loops closer to termination but wouldn't move backwards.

That shouldn't be too hard to verify.

Cheers,
Wol


Copyright © 2019, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds