A proposal for shared memory in BPF programs

By Daroc Alden
February 21, 2024

Alexei Starovoitov introduced a patch series for the Linux kernel on February 6 to add bpf_arena, a new type of shared memory between BPF programs and user space. Starovoitov expects arenas to be useful both for bidirectional communication between user space and BPF programs, and for use as an additional heap for BPF programs. This will likely be useful to BPF programs that implement complex data structures directly, instead of relying on the kernel to supply them. Starovoitov cited Google's ghOSt project as an example and inspiration for the work.

BPF programs already have several ways to communicate with user space, including ring buffers, hash maps, and array maps. However, there are drawbacks to each of these methods. Ring buffers can be used to send performance measurements or trace events to user-space processes — but not to receive data from user space. Hash maps can be used for this purpose, but accessing them from user space requires making a bpf() system call. Array maps can be mapped into a user-space process's address space using mmap(), but Starovoitov notes that their "disadvantage is that memory for the whole array is reserved at the start". Array maps (and the new arenas) are stored in non-pageable kernel memory, so unused pages have a noticeable resource-usage cost.

His patch series allows BPF programs to create arenas of up to 4GB in size. Unlike array maps, these arenas do not allocate pages up front. BPF programs can add pages to the arena using bpf_arena_alloc_pages(), and pages are automatically added when a user-space program triggers a page fault inside the arena.

Seamless pointers

The patch series handles pointers inside the arena in an unusual way, ensuring that structures inside the arena can have pointers to other areas of the arena, and that this works seamlessly for both user-space programs and BPF programs. Neither kind of program needs to be aware that there are implicit conversions happening — even though the two programs have entirely different pointer representations. BPF programs represent pointers into the arena as 32-bit pointers in a separate address space (which the verifier ensures are not used as normal pointers or vice versa), but user-space programs see the pointers as normal pointers for their architecture. The user-space representation is the one that is actually stored in the arena memory. The kernel maps space for the arena such that the lower 32 bits of the user-space pointer always matches the BPF pointer, to keep conversions between the two representations fast.

For example, the series includes a program as part of the test suite that implements a hash table in BPF which uses linked lists to hold items that fall in the same bucket. The hash table can be populated in the kernel and then consumed from user space, or vice versa, with both being able to follow the pointers in the data structure.

The patch series introduces two functions, bpf_cast_kern() and bpf_cast_user() to cast between the kernel representation of a pointer and the user-space representation. There is an associated patch to LLVM's BPF backend to insert these conversions automatically where appropriate, to ensure that the user-space version is the one stored in memory in the arena. The patch series does introduce a new flag (BPF_F_NO_USER_CONV) to let BPF programs turn off this behavior. Arenas that do not perform pointer conversion can still be mapped to user space, but user-space programs will not be able to follow pointers contained therein.

Review concerns

Barret Rhoden pointed out a problem with one detail of the implementation of this conversion. The initial version of the patch series leaves a hole in the arena (depending on where the arena is mapped in user space), so that BPF won't produce an object with a pointer ending in 0x00000000. Such an object would have an all-zero representation in the BPF program when converted into a 32-bit pointer, which could be confused with a null pointer and cause problems. Rhoden noted that "we'll have more serious issues if anyone accidentally tries to use that zero page", pointing out that if the BPF program tries to access the missing page, it will trigger a page fault and then die. Starovoitov agreed, saying that he would remove the missing page in version 2 of the series and that the logic was "causing more harm than good". With the hole in the arena removed, BPF programs will need to avoid putting an object at the zero address and then trying to take a pointer to it, which is easily accomplished by adding some padding to the start of the arena.

Ensuring that the kernel and user space agree on the lower 32 bits of arena pointers is useful because it keeps the code generated by the BPF just-in-time (JIT) compiler simpler and therefore faster. If user space could map the arena at any address — as was the case in the initial version of this patch series — this would make the representation of the arena in the kernel somewhat more complex, and could require additional logic to handle wraparound of the arena addresses cleanly. Rhoden and Starovoitov continued discussing this detail, and eventually concluded that there was no reason to support mapping arenas to truly arbitrary addresses. Rhoden remarked that "the restriction of aligning a 4GB mapping to 4GB boundary is pretty sane."

Lorenzo Stoakes objected to the way in which the patch series allocates pages because it uses vmap_pages_range() to allocate pages for the arena, which is a function internal to the kernel's virtual-memory allocator. Stoakes said: "I see a lot of checks in vmap() that aren't in vmap_pages_range() for instance. [Are we] good to expose that, not only for you but for any other core kernel users?"

Johannes Weiner responded to say that the "vmap API is generally public", and that the "new BPF code needs the functionality of vmap_pages_range() in order to incrementally map privately managed arrays of pages into its vmap area". He went on to note that the function used to be public, and was made private when other external users of the function disappeared. Christoph Hellwig expressed dissatisfaction in another branch of the conversation: "We need to keep vmalloc internals internal and not start poking holes into the abstractions after we've got them roughly into shape."

While reviewing the changes internal to the BPF code, Andrii Nakryiko raised concerns about how the new arenas calculate their size. Existing BPF maps keep track of the size of keys, the size of values, and the total number of entries that can fit in the map. This works well for hash maps and array maps, but is not a good fit for the new arenas. Starovoitov decided to represent the arenas as having a key size and a value size of 8 bytes "to be able to extend it in the future and allow map_lookup/update/delete to do something useful". Nakryiko asserted that they "should probably make bpf_map_mmap_sz() aware of specific map type and do different calculations based on that", going on to point out that arenas are unlikely to be operated on using the normal BPF interfaces for looking up map entries.

Donald Hunter questioned why arenas were being represented in the code as a new kind of map at all, asking whether this was "the only way you can reuse the kernel / userspace plumbing?" Starovoitov replied that many of the existing maps usable by BPF programs don't support some map operations. Bloom filters and ring buffers in particular (two existing map types similar in some ways to the new arenas) do not support lookup, update, or delete operations. He went on to say that arenas "might be one the last maps that we will add, since almost any algorithm can be implemented in the arena".

Starovoitov quickly incorporated this feedback, and published version 2 of the patch series. He had not addressed Hellwig's concerns about exposing the low-level details of the virtual memory allocation code, however. Hellwig reiterated his position, saying: "The vmap area is not for general abuse by random callers". Starovoitov responded that Hellwig ought to suggest an alternative if exposing the vmap_pages_range() function is unacceptable. Linus Torvalds chimed in to say that it is "not up to maintainers to suggest alternatives"; "The onus of coming up with an acceptable solution is on the person who needs something new".

Discussion of this version of the patch series is ongoing but, other than Hellwig's concerns about exposing low-level details of the virtual memory allocation code, most of the other concerns are relatively minor or have been addressed. Being able to seamlessly share memory between BPF programs and user-space code is an attractive proposition, so it seems likely that this work will eventually make it in, even if doing so will require finding a different way for the BPF arena to allocate pages on demand.

Index entries for this article
Kernel	BPF/Memory management
Kernel	Releases/6.9

A proposal for shared memory in BPF programs

Posted Feb 21, 2024 17:51 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (7 responses)

Ah, SharedArrayBuffer makes its appearance! I'll keep saying: just use WASM at this point.

A proposal for shared memory in BPF programs

Posted Feb 21, 2024 22:04 UTC (Wed) by ms-tg (subscriber, #89231) [Link] (2 responses)

> Ah, SharedArrayBuffer makes its appearance! I'll keep saying: just use WASM at this point.

I would be curious to see what might result from a deliberate and intentional effort to describe a WASM-in-kernel system.

My limited exposure to WASM, and even more limited exposure to BPF, don’t lead me to see a ton of overlap beyond the trivial. Perhaps someone hacking a demo “for fun” and sharing it might open up more collaborative exploration of that shared space?

A proposal for shared memory in BPF programs

Posted Feb 21, 2024 22:16 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

There are several demo projects: https://github.com/wasmerio/kernel-wasm or https://github.com/eunomia-bpf/wasm-bpf

Basically, nothing will change much.

The only significant difference is that eBPF's static verifier will be replaced by runtime-based metering. Which will have to happen with eBPF anyway because its verifier can't guarantee realistic runtime bounds when external functions or iterators are used.

A proposal for shared memory in BPF programs

Posted Feb 27, 2024 13:47 UTC (Tue) by ms-tg (subscriber, #89231) [Link]

Thank you for sharing these links!

A proposal for shared memory in BPF programs

Posted Feb 26, 2024 7:08 UTC (Mon) by LtWorf (subscriber, #124958) [Link] (3 responses)

As far as I know WASM doesn't have the verifier, and probably can never have it due to halting problem.

A proposal for shared memory in BPF programs

Posted Feb 26, 2024 7:13 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Here's a dirty little secret of eBPF: neither does eBPF at this point. It can use external functions, including string functions (see: https://man7.org/linux/man-pages/man7/bpf-helpers.7.html ), and the "bounded" time can be "bounded" only veeeery loosely.

And in practice, pretty much all serious WASM runtimes have a way to limit the runtime of the script. Either by doing precise instruction metering, or by using some kind of interruption checking. The kernel would just need to use that functionality.

A proposal for shared memory in BPF programs

Posted Feb 26, 2024 10:21 UTC (Mon) by kleptog (subscriber, #1183) [Link] (1 responses)

> Here's a dirty little secret of eBPF: neither does eBPF at this point. It can use external functions, including string functions (see: https://man7.org/linux/man-pages/man7/bpf-helpers.7.html ), and the "bounded" time can be "bounded" only veeeery loosely.

I guess it depends on what you expect the verifier to do. If you're expecting it to verify it finishes in a specific timeframe then you are correct. If you are verifying it finishes at all, then it doesn't matter.

AIUI the kernel only cares that it eventually exits while leaving the system in a consistent state, and doesn't particularly care when.

A proposal for shared memory in BPF programs

Posted Feb 26, 2024 17:04 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

> AIUI the kernel only cares that it eventually exits while leaving the system in a consistent state, and doesn't particularly care when.

Not quite. Some eBPF programs can now take a lock, and result in measurable system delays even without any external functions used. All the current eBPF guarantees are easy to replicate for any reasonable WASM runtime.

In theory, eBPF can be used in a more complicated way, with static analysis proving that multiple locks are taken and released in the correct order. But I doubt that this type of analysis can be done efficiently in the kernel.

A proposal for shared memory in BPF programs

Posted Feb 21, 2024 18:16 UTC (Wed) by fredex (subscriber, #11727) [Link] (1 responses)

inconsistent spelling: is it Starovitov or Starovoitov?

Typo fixes

Posted Feb 21, 2024 18:43 UTC (Wed) by corbet (editor, #1) [Link]

The latter of course, fixed now.

In the future, please send typo reports to lwn@lwn.net rather than posting them here.