BPF: what's good, what's coming, and what's needed
Years ago, Miller began, Alexei Starovoitov showed up at a netfilter conference promoting his ideas for extending BPF. He described how it could be used to efficiently implement various types of switching fabric — any type, in fact. Miller said that he didn't understand the power of this idea until quite a bit later.
What's good now
BPF, he said, is well defined and useful, solving real problems and
allowing users to do what they want with their systems. It is increasingly
possible to look at what is going on inside the kernel or to modify its
behavior without having to change the source or reboot the system. BPF
provides strict limits on what programs can do, running them inside the
kernel but in a sandboxed mode. BPF programs operate on a specific
object (called the "context"), and there are many places to attach them.
They execute in finite time; something that is assured because
they are not currently allowed to contain loops (though that will change to
some extent eventually). BPF maps provide data structures for programs and
can be used to share access to data.
The BPF verifier, Miller said, is "the only thing between us and extreme peril". It is the last line of defense preventing dangerous code from getting into the system. It is so good, he said, that it often frustrates BPF authors, who have to massage their programs to get them to a point where the verifier will accept them.
The real value of BPF lies in the fact that "we are all arrogant". System designers tend to think that they know what their users want to do, so they make boxes to enable that one thing. But users don't want to be in a box; those users are a constant source of new ideas, and developers often don't know what they want. BPF allows users to escape the box created by system designers, who may be sad about that, but they'll get over it.
BPF has been growing slowly, by word of mouth, Miller said, because there is no "advertising machine" for this technology. Users are still learning about it. The good news is that, once technical people get into a new technology, they tend to spread it around. That has happened with BPF, to the point that people are now writing books about it.
What's improving
BPF is mostly useful now for solving simple problems, Miller said, but it is rapidly gaining the ability to deal with "real programs". One step in that direction is increasing the size limit for BPF programs from its current value of 4096 instructions to 1 million. The prohibition on loops forces developers to unroll loops in their programs, which is unfortunate; the in-development support for bounded loops will fix that problem someday.
BPF programs can perform tail calls now; they function a lot like continuations. Tail calls are a great way to build an execution pipeline, where each step in the pipeline performs a tail call to the next. But now BPF is able to support real function calls as well, but a given program still cannot use both due to limitations in the verifier.
One area where BPF has seen some improvement is introspection; it can be hard for developers to understand why their program is not doing what they want. Indeed, in current kernels, it's hard even to determine which BPF programs have been loaded into the system, or to verify that a loaded program is the one that is wanted. The bpftool utility is improving in its introspection support, as is the BTF format for describing data types, which will help to increase the portability of BPF programs. BTF turns out to be good for annotating BPF programs and how they work. The perf utility is also gaining the ability to drill down into BPF programs. Users cannot complain, Miller said, that visibility into BPF is not being provided.
What's needed
There is no shortage of opportunities for improvement still, he said. For example, BPF does not currently support code reuse all that well; there are a lot of people out there writing their own Ethernet header parsers. There are systems with thousands of redundant BPF programs loaded into them; that is not the way to do software development. Support for function calls will help, but BPF needs libraries, and it will need access control for those libraries once they can be loaded. BTF will help, since it will make it easy to see which libraries are available in a given system.
BPF development is still harder in general than it should be; Miller would like to see the development of a "type and go" environment that makes writing and loading a BPF program as easy as on the Arduino. Unskilled people should be able to get stuff done; that is part of the goal of wresting control away from arrogant system developers.
BPF programs should have "trivial debuggability", he continued. It should be possible to single-step through programs and examine context data. He would like the ability to record a program's execution or state so that it could be stepped through outside of the kernel. Perhaps even live, in-kernel single-stepping could be supported in development environments. The most important thing for the near future, though, is the ability to snapshot the current state of a BPF program.
Finally, he said, BPF needs better access control. Almost all BPF functionality is root-only now, but things will not be that way forever. Much more granular control to BPF functionality is required — or we could always control access to BPF with a BPF program, he said. A file like /dev/bpf could be used for access control, but that's still pretty coarse; perhaps what is needed is a hierarchy of files describing the different program types and their access permissions. BPF also needs better memory accounting, since maps can get quite large.
At this point, Miller concluded his talk and accepted questions. Matthew Wilcox started things off by saying that he will not be impressed by BPF until it becomes possible to play Zork in the kernel. The original Zork was less than 1 million instructions, Wilcox said, so that should be possible.
ABI compatibility
The first actual question was about whether there are any inherent limits on what BPF will eventually be able to do. Early on, Miller answered, it was used for tasks like packet analysis, and current usage still reflects that. BPF will not be usable to implement a proprietary TCP stack in the kernel, for example; that is not a goal. Among other things, there are no timers available to BPF programs and no plans to add them.
Some people do try to push the limits, Miller said. Steve Hemminger tried to convert a packet scheduler to BPF, for example, but eventually ran into the timer issue. Somebody else, though, managed to create a complete implementation of OpenVSwitch, but that sort of project really misses the point of BPF. The real value is not in doing everything, but in being able to do exactly what you need and no more.
Ted Ts'o said that he did not expect to see device drivers in BPF, but Miller responded that those already exist. He was referring to the ability to perform infrared protocol decoding in a BPF program. That eliminates the need to support hundreds of infrared devices in a kernel driver and allows support for new devices to be easily added to older kernels. Ts'o conceded that point, but said that it was unlikely that there would be an NVIDIA GPU driver written in BPF anytime soon.
Another attendee asked about ABI compatibility; will the kernel have to support existing BPF programs forever? Miller responded that BPF exists in an "ambiguous plane" between kernel ABI and the "wild west" of the kernel's internals. Tools like BTF will help to make things more compatible over time. Meanwhile, the BPF developers have taken liberties to break things in the early stages; the community is still learning how all of this stuff should work. But that should happen less often over time. That said, he doesn't think it will ever be possible to write a BPF program and expect that it will work on every future kernel.
The discussion turned to the powertop episode, where a change to a tracepoint broke the powertop utility and had to be reverted. As a result of that, some maintainers still refuse to allow the addition of tracepoints in their subsystems. The problem is that powertop was useful, so users complained when it broke. BPF programs, too, will be useful, and are likely to suffer from the same problems. Brendan Gregg may have said earlier in the week that occasional breakage was OK, but someday some user will complain and Linus Torvalds will revert a BPF-visible change. Miller responded that, whenever a new facility like this is added, there is always a period in which things break. We'll never get away from that, but it will get a lot better.
Ts'o worried about how bad the ABI pain would be; some BPF interfaces will not be changeable, he said. At least, it will not be possible to change them without a ten-year deprecation period while old programs are fixed. Miller said that, with BPF, users are often happy when things break, because it usually indicates that new information is available for them to work with.
Gregg said that, in the absence of tracepoints, current BPF tools are using a lot of kprobes. There are a lot of kernel-version checks that go with them, but they still break with every kernel release. If the kernel moves to tracepoints that only break once every five years, that will be fantastic. Ts'o wondered whether the breakage of a kprobe-based tool that is seen as being as useful as powertop would cause Torvalds to revert a change. He does not know the answer to that.
Security
Dave Hansen asked about security and side channels; BPF was one way in which the Spectre vulnerabilities could be exploited early on. These issues have been mitigated one at a time as they are found, but has any thought been given to broader mitigations? Miller acknowledged that programs can be written to exploit speculative execution vulnerabilities; the verifier can often detect and block such attempts. On the other hand, BPF can also improve security. He mentioned an episode where a bug in a custom hash computation could be exploited to crash the kernel; it was possible to move the computation to BPF and block exploits until the kernel was fixed. Hansen continued, saying that the kernel-hardening efforts are trying to address problems proactively; work in the BPF area, he said, is more reactive. Miller conceded that point, but said that, hopefully, the kernel is becoming sufficiently hardened that it will no longer be necessary to worry about these issues all the time.
The final question came from Ts'o, who wondered about how BPF will interact with Linux security modules. With the advent of stackable security modules, it should be possible to implement more flexible access control for BPF programs. He also suggested that perhaps some verifier policies should include interaction with the security-module subsystem.
Miller answered that the verifier has a set of operations specific to each
program type; it should be possible to add a security-module hook there
somehow. He also observed, with amusement, that SELinux is using classic
BPF now for a few things. It would be great, he said, to use BPF to create
new security policies; it could be the "universal security policy engine".
That would allow for the immediate addition of new policies without the
need to wait for the next kernel release.
| Index entries for this article | |
|---|---|
| Kernel | BPF |
| Conference | Storage, Filesystem, and Memory-Management Summit/2019 |
Posted May 9, 2019 18:20 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted May 9, 2019 20:44 UTC (Thu)
by roc (subscriber, #30627)
[Link] (3 responses)
Posted May 10, 2019 14:29 UTC (Fri)
by Paf (subscriber, #91811)
[Link] (2 responses)
So this seems impossible? Recording the full execution might work, but I think nothing short of that would be sufficient, because “the inputs” are more than what the user provides.
Posted May 10, 2019 16:04 UTC (Fri)
by excors (subscriber, #95769)
[Link]
For any code that's even vaguely timing-sensitive (which I assume includes nearly everything running in a kernel that's full of timers and timeouts and hardware interfaces), that's much more useful than a debugger that pauses the program while it's running.
Posted May 10, 2019 22:04 UTC (Fri)
by roc (subscriber, #30627)
[Link]
Posted May 9, 2019 22:56 UTC (Thu)
by dbkm11 (guest, #125598)
[Link]
Posted May 10, 2019 2:07 UTC (Fri)
by SMK (guest, #131799)
[Link]
This is a great roundup of BPF session, that frankly i've not seen anywhere else...i'm particularly interested though in how groups like Cilium are (or aren't) contributing to help out with the missing bits...
Posted May 10, 2019 16:04 UTC (Fri)
by Bronek (guest, #120079)
[Link]
Posted May 10, 2019 22:08 UTC (Fri)
by roc (subscriber, #30627)
[Link] (5 responses)
Posted May 10, 2019 22:34 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
After all, a million instructions is already more than some subsystems can afford.
Posted May 13, 2019 23:37 UTC (Mon)
by atai (subscriber, #10977)
[Link] (1 responses)
Posted May 14, 2019 0:30 UTC (Tue)
by pizza (subscriber, #46)
[Link]
Posted May 16, 2019 3:06 UTC (Thu)
by naptastic (guest, #60139)
[Link] (1 responses)
Posted May 16, 2019 3:43 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted May 12, 2019 5:14 UTC (Sun)
by alison (subscriber, #63752)
[Link] (1 responses)
Is there some reason why sysfs is not the obvious answer? Maybe /sys/kernel/debug/bpf, or /sys/kernel/bpf? That would certainly clarify the ABI stability question! The solution would be to keep new BPF interfaces in staging branch until we're sure we'll keep them.
Posted May 12, 2019 19:12 UTC (Sun)
by jkowalski (guest, #131304)
[Link]
Posted May 14, 2019 9:04 UTC (Tue)
by ncm (guest, #165)
[Link]
I would like to have a range of user memory mapped, and directly accessible, to an eBPF program in such a way that eBPF code can initiate block writes to that memory without any permission checks or translation steps beyond a simple range check.
In other words, I want to pre-clear an unlimited number of writes to a mapped user buffer. In use, this will be a multi-GB ring buffer, mapped as some thousands of hugepages. The hugepages need not appear consecutive to the eBPF code, so long as I can learn at startup where each hugepage is mapped in user space.
In actual use, the eBPF code would be executing inside the NIC, and the writes would amount to DMA operations over the PCI bus.
Posted May 29, 2019 13:17 UTC (Wed)
by Wol (subscriber, #4433)
[Link]
For those who don't remember them, the FORTRAN spec explicitly said that the index could be kept in read-only (as far as the programmer is concerned) memory. Modifying the loop index didn't necessarily affect the loop.
So a "DO II = 1 to 10" is guaranteed to execute ten times, with a monotically increasing II, even if the code inside the loop tries to modify II. (I think some implementations actually separated the loop index from the variable so the programmer could modify II but the new version wasn't used for the loop.) You could always add a statement like "SKIP II 5", which would move 5 loops closer to termination but wouldn't move backwards.
That shouldn't be too hard to verify.
Cheers,
BPF: what's good, what's coming, and what's needed
Ugh.
BPF: what's good, what's coming, and what's needed
BPF: what's good, what's coming, and what's needed
BPF: what's good, what's coming, and what's needed
BPF: what's good, what's coming, and what's needed
BPF: what's good, what's coming, and what's needed
BPF: what's good, what's coming, and what's needed
BPF: what's good, what's coming, and what's needed
BPF: what's good, what's coming, and what's needed
BPF: what's good, what's coming, and what's needed
BPF: what's good, what's coming, and what's needed
BPF: what's good, what's coming, and what's needed
BPF: what's good, what's coming, and what's needed
BPF: what's good, what's coming, and what's needed
BPF: what's good, what's coming, and what's needed
> describing the different program types and their access
> permissions.
BPF: what's good, what's coming, and what's needed
BPF: what's good, what's coming, and what's needed
The prohibition on loops
Wol
