A possible path for cancelable BPF programs
The Linux kernel supports attaching BPF programs to many operations. This is generally safe because the BPF verifier ensures that BPF programs can't misuse kernel resources, run indefinitely, or otherwise escape their boundaries. There is continuing tension, however, between trying to expand the capabilities of BPF programs and ensuring that the verifier can handle every edge case. On February 14, Juntong Deng shared a proof-of-concept patch set that adds some run-time checks to BPF to make it possible in the future to interrupt a running BPF program.
When initially conceived, BPF had strict limits on the number of instructions a program could contain, and did not permit loops, limiting how long a program could run. This is important because the kernel will call BPF hooks during many time-sensitive operations, so a misbehaving BPF program that managed to run for too long could potentially cause kernel hangs or other problems. Over time, those limits have been gradually expanded for two reasons: the verifier has become more capable — handling loops, more complicated functions, etc. — and developers have discovered that complicated, long-running BPF programs are quite useful, prompting them to ask for the limits to be loosened.
Theoretically, BPF could have some kind of watchdog system to kill programs that take too long at run time, instead of trying to bound the run time of the program statically. This would have some advantages, such as allowing for programs with complex control flow that the verifier cannot currently understand. The major problem with this idea is that BPF programs can hold kernel resources, such as locks, and therefore killing them is not as simple as just interrupting their execution.
Deng's patch set tackles this problem by adding relatively low-overhead tracking of acquired kernel resources. With dynamic tracking, the process of killing a BPF program gets much simpler, because the kernel can release any resources held by the program at any time. The patch set doesn't implement the actual watchdog and associated killing of BPF programs itself, yet. Instead, it is more of a proof of concept to show that such a thing could be made to work with BPF's use of kernel resources.
Deng is also clear that even once the patch set is ready, it will still just be an intermediate step. The verifier should continue being improved until it can track which resources need to be cleaned up at each point in the program statically:
Note that this patch series is not intended to replace pre-runtime/post-runtime efforts, and having no runtime overhead is always better than having some. Our final goal is to have no runtime overhead. In the future, as these no-runtime-overhead solutions mature, the runtime-overhead solutions can be disabled.
This is similar to the relationship between BPF JIT and BPF interpreter. We always know that JIT is better and should be used eventually, but the interpreter is a not too bad alternative when JIT is not ready or cannot be used, and can help us support some features faster.
The details
Any solution needs to perform well, given how prevalent BPF programs are in performance-sensitive parts of the kernel. In this case, that means avoiding dynamic allocation for resource tracking. Luckily, the verifier already tracks when BPF programs acquire or release resources. Deng's patch set adds code to check the maximum number of resources that a program holds simultaneously, and allocates a static table to hold resource information. The table holds a series of slots, which are initialized to each hold a pointer to the next one, forming a linked list of free slots.
That table needs to not only hold pointers to the resources in question, it also needs to store the type of the resource. BPF programs can hold several different types of kernel resources, which all need to be acquired and released using specific functions. For example, bpf_task_from_pid() is used to acquire a reference to a task_struct, which is released with bpf_task_release().
In order to look up the correct release function based on the type of the resource, Deng's patch builds a table of types. Theoretically, when a BPF program is killed, the code can do binary search in the table to find the resource type, and thus its release function. Most of the time, however, the BPF program will release the resource normally. When this happens, the kernel needs to find the entry in the resource table for the resource and add it to the list of free slots. This is done using a fixed-size hash table so that the lookup is fast.
Overall, Deng's patch adds a small amount of run-time overhead to the process of acquiring and releasing resources. When acquiring, the code needs to look up the resource type using binary search, pop a slot off the free slots list, and insert an entry into the hash table. When releasing, the code needs to look up the entry in the hash table, remove it from the table, and push a slot onto the free slots list. All of these steps take an average of constant time, with the exception of looking up the resource type, which takes time logarithmic in the number of BPF types that can be tracked in this way.
In order to actually do this work, the patch changes the verifier to insert a call to a hook function that does the accounting and then forwards the call to the real function every time the program calls an acquire or release function. This works well for kfuncs (kernel functions exposed to BPF), and means that the kfuncs themselves don't need to be modified. Kfuncs are already annotated with KF_ACQUIRE and KF_RELEASE flags to indicate that they are involved in acquiring or releasing a resource. These functions, of which there are 39 pairs in the 6.13 kernel, all have similar signatures, taking only a single argument of their associated type. Deng's code uses that information to figure out the association between kfuncs and types.
That approach doesn't work for BPF helpers (functions exposed through an older and more brittle mechanism), however. Helpers don't have the same kinds of annotations. A full solution to making BPF programs killable will need to hard-code information on which BPF helpers are associated with different types.
Deng later suggested using the same hook mechanism to add run-time tracing for BPF programs. The verifier could, in a special debug mode, add a hook for every kfunc, not only the ones associated with acquiring or releasing resources. That could potentially extend something like the kernel's ftrace mechanism to work with BPF programs.
The future of BPF
Deng's patch set has not seen any discussion yet — but the underlying idea of adding more run-time checks to BPF is not new. BPF's reliance on the verifier in order to run code without most run-time checking is one of the things that makes it unusual among language runtimes. Other compiled languages can use sophisticated analyses to remove redundant checks or indirections, but they usually approach the problem by starting with all of the necessary checks, and then selectively prune away the ones the compiler can prove are not needed. BPF, on the other hand, essentially makes anything the verifier can't understand the programmer's problem.
Does this unique feature of BPF need to be preserved at any cost, requiring kernel developers to eschew any approach that introduces run-time overhead? Or is the more common approach that other languages take the inevitable future of trying to push for BPF to become more capable? Deng's patch tries to walk a middle path — introducing a feature with run-time overhead with the expectation that it will be replaced when the verifier is improved to track how resources need to be released. But it's hard not to see the need for patches like this as indicative of the fact that developers want access to new capabilities that the verifier cannot provide.
The verifier is already complex, and only going to become more so. Whether that will drive BPF developers to move away from implementing everything in the verifier over time remains to be seen. Either way, the topic seems likely to provoke some amount of discussion in the coming months.
Posted Feb 25, 2025 19:16 UTC (Tue)
by roc (subscriber, #30627)
[Link]
> BPF, on the other hand, essentially makes anything the verifier can't understand the programmer's problem.
This is a nightmare. Other comparable systems (CIL, JVM, WASM) carefully define what bytecode programs are valid and pretty much never change them (maybe relaxing a bit here and there), but valid bytecode is rich enough for compilers to compile any valid source program so users don't have to care.
I hope the BPF community eventually figures this out, but they're accumulating a horrendous amount of baggage in the meantime.
Posted Feb 25, 2025 20:27 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
At what point are they going to admit that they literally reimplemented a WASM runtime piecemeal?
Posted Feb 26, 2025 5:24 UTC (Wed)
by tullmann (subscriber, #20149)
[Link] (1 responses)
I do think the idea of isolating untrusted code so you can cleanly and efficiently terminate it is ... what user mode was originally designed for. So there is definitely a lot of prior art nearby in the kernel ....
Posted Feb 26, 2025 20:27 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link]
But yeah, all other technologies have just one fatal flaw.
Posted Feb 25, 2025 20:58 UTC (Tue)
by pj (subscriber, #4506)
[Link]
Posted Feb 26, 2025 9:00 UTC (Wed)
by taladar (subscriber, #68407)
[Link] (1 responses)
Usually write locks are held to make some changes to a data structure that should not be seen by anyone else in isolation but as a completed whole. If we just kill the program in the middle of such a change that would just leave the data structure in an undefined state forever, a much worse problem than a BPF program taking longer to finish than anticipated.
Posted Feb 26, 2025 10:44 UTC (Wed)
by aviallon (subscriber, #157205)
[Link]
Just take the obvious approach already
Sigh. NIH continues.
Sigh. NIH continues.
Sigh. NIH continues.
restricted BPF runtimes?
Just freeing locks? Doesn't that corrupt data?
Just freeing locks? Doesn't that corrupt data?
Or is it that BPF programs can't modify problematic resources directly anyway?
