|
|
Log in / Subscribe / Register

Red-black trees for BPF programs

By Jonathan Corbet
February 27, 2023
Most of the kernel's code is written in C and intended to be run directly on the underlying hardware. That situation is changing in a few ways, though; one of those is the ability to write kernel code for the BPF virtual machine. The 6.3 kernel release will include a new API making the red-black tree data structure available to BPF programs. Beyond being an interesting feature in its own right, this new API shows how BPF is bringing a different approach to kernel programming — and to the C language in general.

The kernel has long made extensive use of red-black trees (rbtrees), which are a form of binary tree; this data structure offers fast lookups and the ability to perform insertions and deletions in bounded time. Red-black trees are found in I/O schedulers, graphics drivers, filesystems, the BPF verifier, CPU-scheduler run queues, network protocols, and beyond. One place they have not been found, though, is in programs written to run in the BPF virtual machine. As the complexity of BPF programs grows, though, so does the demand for advanced data structures. The BPF version of the red-black tree, added by Dave Marchevsky, is meant to address this need.

Within the kernel, data intended to be stored in an rbtree must be stored in a structure that embeds a struct rb_node. The BPF API looks similar in this respect; as described in the cover letter to the patch set, the first step is to define a structure to hold the data of interest along with a bpf_rb_node structure:

    struct node_data {
    	long key;
    	long data;
    	struct bpf_rb_node node;
    };

Kernel code will then declare a variable of type struct rb_root to hold the root of the tree. The interface for BPF programs looks a bit different; they must declare the root and associate it with the type it contains using a magic macro:

    struct bpf_rb_root tree_root __contains(node_data, node);

The program must also declare a spinlock and, by storing it into an array map alongside the tree_root variable, associate that lock with the red-black tree.

Thereafter, there are three functions (as of 6.3) that can be used to work with the rbtree:

    void bpf_rbtree_add(struct bpf_rb_root *root, struct bpf_rb_node *node,
    		   	bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b));
    struct bpf_rb_node *bpf_rbtree_remove(struct bpf_rb_root *root,
			struct bpf_rb_node *node);
    struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root);

Adding a node is done with bpf_rbtree_add(); the less() function is used to compare nodes so that the new node can be properly located within the tree. The first node in a tree can be had with bpf_rbtree_first(), and nodes can be removed with bpf_rbtree_remove(). It is a fairly bare API for now; more functions can be expected to appear in the future.

There are a number of interesting aspects to this API, starting with the fact that it exists at all. For a long time, there was only one way to add a complex data structure to BPF: as a BPF map. The BPF virtual machine simply did not have the support needed allow programs to directly manipulate data structures of any complexity. That has changed, especially with the addition of kfuncs (which allow BPF programs to directly call functions in the kernel), and better management of pointers. These features, along with the BPF verifier, create an environment that differs considerably from ordinary kernel development.

For example, as noted above, the program must associate a spinlock with the red-black tree. Once that is done, the verifier will ensure that no access to the tree — or to the data contained within it — happens unless the spinlock is held at the time. Acquiring a spinlock to access a data structure protected by that lock is mandatory in kernel code; severely unpleasant things will happen if the rule isn't followed. But the language itself cannot enforce that rule, so locking bugs are a fairly routine occurrence. As long as the verifier is doing its job properly, similar bugs cannot occur in BPF programs, at least as far as access to an rbtree is concerned.

The verifier also enforces rules regarding the ownership of pointers. For example, the first step in adding a node to an rbtree will be allocating that node, which is typically done with a call to bpf_obj_new(). On return from that call, the program will own the resulting pointer, and the verifier must be able to convince itself that the program will suitably dispose of that pointer before it exits. One way to do so is to free the memory again, naturally, but that is not hugely interesting. Another is to use bpf_rbtree_add() to add the new node to an rbtree, which will then take responsibility for it. If the program removes a node from the tree, it must, once again, take responsibility for disposing of it or the verifier will not let it run.

In other words, the verifier is implementing a sort of ownership model to ensure that memory leaks do not happen. It can also ensure that, once an object has been freed, the BPF program will no longer attempt to access it. In the case of rbtrees, implementing that check took some work, since accessing the tree (with bpf_rbtree_first(), for example) can create "non-owning" references that must all be invalidated when a node is freed. In other words, the verifier is preventing the creation of dangling pointers via aliases to a freed data structure.

The end result of all this checking is a programming environment that seems a bit more Rust-like and less C-like; there are whole classes of bugs that are eliminated before a program is allowed to run. Making the verifier happy can be a notoriously difficult task for some programming patterns, but the end result should be a higher level of assurance that the program will not damage the kernel.

Meanwhile, as of this writing, there are no users of the new data structure in the kernel. As a general rule, kernel developers will resist the addition of an API without associated users, but BPF is different; the whole point of a BPF API is to be available to programs that are not part of the kernel tree. There is one known user out of the kernel tree in the form of the extensible scheduler class; it would not be surprising to see others turn up as well.

As BPF maintainer Alexei Starovoitov said at a conference in 2022, the intent behind the BPF work is to create a safer version of the C language for kernel programming. This work is being done outside of the normal limelight that follows programming-language development and, like much of the kernel, is not following any sort of long-term roadmap. If this effort succeeds, kernel programming in the coming years may be quite different from how it has been until now. The addition of the red-black tree for BPF programs, in other words, is a piece of an interesting and novel forest.

Index entries for this article
KernelBPF
KernelReleases/6.3


to post comments

Red-black trees for BPF programs

Posted Feb 27, 2023 16:17 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (2 responses)

> void bpf_rbtree_add(struct bpf_rb_root *root, struct bpf_rb_node *node,
> bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b));

Some…interesting things can probably be done with a non-pure `less` function or by using different functions for different `bpf_rbtree_add` calls. Does the verifier do anything along those lines?

Red-black trees for BPF programs

Posted Feb 28, 2023 1:15 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (1 responses)

Looking through the patches and comments, it looks like it's doing some kind of funky non-owned ref logic to protect against that, but I don't speak kernel-C well enough to understand exactly how it all works. In particular, I couldn't figure out where it prevents you from passing a different less (but I didn't look very hard, frankly).

Red-black trees for BPF programs

Posted Feb 28, 2023 2:21 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

The non-owned ref logic can help prevent mutating the nodes in the `less` function (I presume/hope that `const_cast` spell in the C way is verbotten as well?). But are there any BPF rules that forbid referencing some BPF map, mutable global variable, or volatile (to BPF) kernel data binding?

Red-black trees for BPF programs

Posted Feb 28, 2023 7:37 UTC (Tue) by smurf (subscriber, #17840) [Link] (4 responses)

All of this sounds like it'd be a whole lot more straightforward to write BPF programs in Rust.

Red-black trees for BPF programs

Posted Mar 2, 2023 11:35 UTC (Thu) by rwmj (subscriber, #5474) [Link] (3 responses)

How would that solve anything, unless the userspace rust compiler was a "trusted compiler"? The verifier must still be able to verify the program so the kernel trusts it. Trusted compilers can be done -- see eg. the Burroughs architecture of old -- but they require a whole other infrastructure to ensure that only programs compiled by the trusted compiler can be run.

Red-black trees for BPF programs

Posted Mar 2, 2023 13:35 UTC (Thu) by smurf (subscriber, #17840) [Link] (2 responses)

The problem isn't the backend. The problem is the frontend.

Rust already guarantees e.g. that the source doesn't have memory leaks, it knows about ownership, etc., and can emit appropriate errors if the source code violates these constraints. Adding that kind of thing to C requires nontrivial infrastructure that's not part of the normal toolchain and IMHO can't ever be as comprehensive as using a language where that's built in from the beginning.

The problem isn't correct programs (except for the halting problem of course). The problem is incorrect programs and how to discover bugs before the verifier complains about them.

Red-black trees for BPF programs

Posted Mar 2, 2023 13:56 UTC (Thu) by rwmj (subscriber, #5474) [Link]

You can just run the verifier if you want to check your program is correct. I'm having a hard time still seeing how this helps at all with the important bit: The kernel needs to verify the code it is loading or the verification must have been done already by a trusted chain if you want to go the trusted compiler approach.

Personally I think the whole thing would be much more easily solved by using a simple garbage collector. Easier for the developers too.

Red-black trees for BPF programs

Posted Mar 2, 2023 18:05 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

Hmm. What model do you propose for `bpf_exit()` such that Rust's existing rules won't let a lock dangle? AFAIK, extra-program behavior such as external locks are outside of Rust's safety model. For example, this will leave the file locked (assuming the kernel doesn't clean up such things on program exit):

```
let file_lock = lockf(fd, F_LOCK, 0)?;
std::panic("hopefully the kernel cleans up after the above…");
```

Why shouldn't the BPF runtime do the cleanup of any locks, open handles, etc. upon the program's exit just as the kernel does for regular processes. At that point…Rust doesn't seem to be buying much other than maybe a preferable syntax for any casts, a lack of integer promotion rules, and various other implicit things requiring more explicit spellings.


Copyright © 2023, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds