A proposal for shared memory in BPF programs
Alexei Starovoitov introduced a patch series for the Linux kernel on February 6 to add bpf_arena, a new type of shared memory between BPF programs and user space. Starovoitov expects arenas to be useful both for bidirectional communication between user space and BPF programs, and for use as an additional heap for BPF programs. This will likely be useful to BPF programs that implement complex data structures directly, instead of relying on the kernel to supply them. Starovoitov cited Google's ghOSt project as an example and inspiration for the work.
BPF programs already have several ways
to communicate with user space, including
ring buffers,
hash maps, and
array maps. However, there are drawbacks to each of
these methods. Ring buffers can be used to send performance
measurements or trace events to user-space processes — but not to receive data
from user space. Hash maps can be used for this purpose, but
accessing them from user space requires making a
bpf()
system call. Array maps can be mapped into a user-space process's address space
using
mmap(), but Starovoitov notes that their "disadvantage is that
memory for the whole array is reserved at the start
". Array maps (and the
new arenas) are stored in non-pageable kernel memory, so unused pages have a
noticeable resource-usage cost.
His patch series allows BPF programs to create arenas of up to 4GB in size. Unlike array maps, these arenas do not allocate pages up front. BPF programs can add pages to the arena using bpf_arena_alloc_pages(), and pages are automatically added when a user-space program triggers a page fault inside the arena.
Seamless pointers
The patch series handles pointers inside the arena in an unusual way, ensuring that structures inside the arena can have pointers to other areas of the arena, and that this works seamlessly for both user-space programs and BPF programs. Neither kind of program needs to be aware that there are implicit conversions happening — even though the two programs have entirely different pointer representations. BPF programs represent pointers into the arena as 32-bit pointers in a separate address space (which the verifier ensures are not used as normal pointers or vice versa), but user-space programs see the pointers as normal pointers for their architecture. The user-space representation is the one that is actually stored in the arena memory. The kernel maps space for the arena such that the lower 32 bits of the user-space pointer always matches the BPF pointer, to keep conversions between the two representations fast.
For example, the series includes a program as part of the test suite that implements a hash table in BPF which uses linked lists to hold items that fall in the same bucket. The hash table can be populated in the kernel and then consumed from user space, or vice versa, with both being able to follow the pointers in the data structure.
The patch series introduces two functions, bpf_cast_kern() and bpf_cast_user() to cast between the kernel representation of a pointer and the user-space representation. There is an associated patch to LLVM's BPF backend to insert these conversions automatically where appropriate, to ensure that the user-space version is the one stored in memory in the arena. The patch series does introduce a new flag (BPF_F_NO_USER_CONV) to let BPF programs turn off this behavior. Arenas that do not perform pointer conversion can still be mapped to user space, but user-space programs will not be able to follow pointers contained therein.
Review concerns
Barret Rhoden
pointed out a problem with one detail of the implementation of this
conversion. The initial version of the patch series leaves a hole in the arena
(depending on where the arena is mapped in user space), so that BPF won't
produce an object with a pointer ending in 0x00000000. Such an object would have
an all-zero representation in the BPF program when converted into a 32-bit
pointer, which could be confused with a null pointer and cause problems.
Rhoden noted that
"we'll have more serious issues if anyone accidentally
tries to use that zero page
", pointing out that if the BPF program tries to
access the missing page, it will trigger a page fault and then die.
Starovoitov agreed, saying that he would
remove the missing page in version 2 of the series and that
the logic was "causing more harm than good
".
With the hole in the arena removed,
BPF programs will need to avoid putting an object at the
zero address and then trying to take a pointer to it, which is easily
accomplished by adding some padding to the start of the arena.
Ensuring that the kernel and user space agree on the lower 32 bits of arena
pointers is useful because it keeps the code generated by the BPF just-in-time
(JIT) compiler simpler and therefore faster.
If user space could map the arena at any address — as was the case in
the initial version of this patch series —
this would make the representation of
the arena in the kernel somewhat more complex, and could require additional
logic to handle wraparound of the arena addresses cleanly.
Rhoden and Starovoitov continued discussing this detail, and eventually
concluded that there was no reason to support mapping arenas to truly arbitrary
addresses. Rhoden
remarked that "the restriction of aligning a 4GB mapping to
4GB boundary is pretty sane.
"
Lorenzo Stoakes
objected to the way in which the patch series allocates pages because it
uses
vmap_pages_range() to allocate pages for the arena, which is a
function internal to the kernel's virtual-memory allocator. Stoakes said:
"I see a lot of checks in vmap() that aren't in vmap_pages_range()
for instance. [Are we] good to expose that, not only for you but for any other
core kernel users?
"
Johannes Weiner
responded to say that the "vmap API is generally public
", and that
the "new BPF code needs the functionality of vmap_pages_range() in
order to incrementally map privately managed arrays of pages into its
vmap area
". He went on to note that the function used to be public, and was
made private when other external users of the function disappeared. Christoph
Hellwig expressed dissatisfaction in
another branch of the conversation:
"We need to keep vmalloc internals internal and not start
poking holes into the abstractions after we've got them roughly into
shape.
"
While reviewing the changes internal to the BPF code,
Andrii Nakryiko
raised concerns about how the new arenas calculate their size.
Existing BPF maps keep track of the size of keys, the size of values, and the
total number of entries that can fit in the map. This works well for hash maps
and array maps, but is not a good fit for the new arenas. Starovoitov
decided to
represent the arenas as having a key size and a value size of 8 bytes "to be
able to extend
it in the future and allow map_lookup/update/delete to do something useful
".
Nakryiko asserted that they "should probably make bpf_map_mmap_sz() aware of
specific map type and do different calculations based on that
", going on to
point out that arenas are unlikely to be operated on using the normal BPF
interfaces for looking up map entries.
Donald Hunter
questioned
why arenas were being represented in the code as a new kind of map at all,
asking whether this was "the only way you can reuse the kernel / userspace
plumbing?
" Starovoitov
replied that many of the existing maps usable by
BPF programs don't support some map operations. Bloom filters and ring buffers
in particular (two existing map types similar in some ways to the new arenas)
do not support lookup, update, or delete operations. He went on to say that
arenas "might be
one the last maps that we will add, since almost any algorithm can be
implemented in the arena
".
Starovoitov quickly incorporated this feedback, and published
version 2 of the patch series. He had not addressed Hellwig's concerns about
exposing the low-level details of the virtual memory allocation code, however.
Hellwig
reiterated his position, saying: "The vmap area is not for general abuse
by random callers
". Starovoitov
responded that Hellwig ought to
suggest an alternative if exposing the vmap_pages_range() function is
unacceptable. Linus Torvalds
chimed in to say that it is "not up to maintainers to suggest
alternatives
"; "The onus of coming up with an acceptable
solution is on the person who
needs something new
".
Discussion of this version of the patch series is ongoing but, other than Hellwig's concerns about exposing low-level details of the virtual memory allocation code, most of the other concerns are relatively minor or have been addressed. Being able to seamlessly share memory between BPF programs and user-space code is an attractive proposition, so it seems likely that this work will eventually make it in, even if doing so will require finding a different way for the BPF arena to allocate pages on demand.
Index entries for this article | |
---|---|
Kernel | BPF/Memory management |
Kernel | Releases/6.9 |
Posted Feb 21, 2024 17:51 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (7 responses)
Posted Feb 21, 2024 22:04 UTC (Wed)
by ms-tg (subscriber, #89231)
[Link] (2 responses)
I would be curious to see what might result from a deliberate and intentional effort to describe a WASM-in-kernel system.
My limited exposure to WASM, and even more limited exposure to BPF, don’t lead me to see a ton of overlap beyond the trivial. Perhaps someone hacking a demo “for fun” and sharing it might open up more collaborative exploration of that shared space?
Posted Feb 21, 2024 22:16 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Basically, nothing will change much.
The only significant difference is that eBPF's static verifier will be replaced by runtime-based metering. Which will have to happen with eBPF anyway because its verifier can't guarantee realistic runtime bounds when external functions or iterators are used.
Posted Feb 27, 2024 13:47 UTC (Tue)
by ms-tg (subscriber, #89231)
[Link]
Posted Feb 26, 2024 7:08 UTC (Mon)
by LtWorf (subscriber, #124958)
[Link] (3 responses)
Posted Feb 26, 2024 7:13 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
And in practice, pretty much all serious WASM runtimes have a way to limit the runtime of the script. Either by doing precise instruction metering, or by using some kind of interruption checking. The kernel would just need to use that functionality.
Posted Feb 26, 2024 10:21 UTC (Mon)
by kleptog (subscriber, #1183)
[Link] (1 responses)
I guess it depends on what you expect the verifier to do. If you're expecting it to verify it finishes in a specific timeframe then you are correct. If you are verifying it finishes at all, then it doesn't matter.
AIUI the kernel only cares that it eventually exits while leaving the system in a consistent state, and doesn't particularly care when.
Posted Feb 26, 2024 17:04 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Not quite. Some eBPF programs can now take a lock, and result in measurable system delays even without any external functions used. All the current eBPF guarantees are easy to replicate for any reasonable WASM runtime.
In theory, eBPF can be used in a more complicated way, with static analysis proving that multiple locks are taken and released in the correct order. But I doubt that this type of analysis can be done efficiently in the kernel.
Posted Feb 21, 2024 18:16 UTC (Wed)
by fredex (subscriber, #11727)
[Link] (1 responses)
Posted Feb 21, 2024 18:43 UTC (Wed)
by corbet (editor, #1)
[Link]
In the future, please send typo reports to lwn@lwn.net rather than posting them here.
A proposal for shared memory in BPF programs
A proposal for shared memory in BPF programs
A proposal for shared memory in BPF programs
A proposal for shared memory in BPF programs
A proposal for shared memory in BPF programs
A proposal for shared memory in BPF programs
A proposal for shared memory in BPF programs
A proposal for shared memory in BPF programs
A proposal for shared memory in BPF programs
The latter of course, fixed now.
Typo fixes