Sleepable BPF programs

By Jonathan Corbet
July 7, 2020

When support for classic BPF was added to the kernel many years ago, there was no question of whether BPF programs could block in their execution. Their functionality was limited to examining a packet's contents and deciding whether the packet should be forwarded or not; there was nothing such a program could do to block. Since then, BPF has changed a lot, but the assumption that BPF programs cannot sleep has been built deeply into the BPF machinery. More recently, classic BPF has been pushed aside by the extended BPF dialect; the wider applicability of extended BPF is now forcing a rethink of some basic assumptions.

BPF programs can now do many things that were not possible for classic BPF programs, including calling helper functions in the kernel, accessing data structures ("maps") shared with the kernel or user space, and synchronizing with spinlocks. The core assumption that BPF programs are atomic has not changed, though. Once the kernel jumps into a BPF program, that program must complete without doing anything that might put the thread it is running in to sleep. BPF programs themselves have no way of invoking any sort of blocking action, and the helper functions exported to BPF programs by the kernel are required to be atomic.

As BPF gains functionality and grows toward some sort of sentient singularity moment, though, the inability to block is increasingly getting in the way. There has, thus, been interest in making BPF programs sleepable for some time now, and that interest has recently expressed itself as code in the form of this patch set from Alexei Starovoitov.

The patch adds a new flag, BPF_F_SLEEPABLE, that can be used when loading BPF programs into the kernel; it marks programs that may sleep during their execution. That, in turn, informs the BPF verifier about the nature of the program, and brings a number of new restrictions into effect. Most of these restrictions are the result of the simple fact that the BPF subsystem was never designed with sleepable programs in mind. Parts of that subsystem have been updated to handle sleeping programs correctly, but many other parts have not. That is likely to change over time but, until then, the functionality implemented by any part of the BPF subsystem that still expects atomicity is off-limits to sleepable programs.

For example, of the many types of BPF programs supported by the kernel, only two are allowed to block: those run from the Linux security module subsystem and tracing programs (BPF_PROG_TYPE_LSM and BPF_PROG_TYPE_TRACING). Even then, tracing programs can only sleep if they are attached to security hooks or are attached to functions that have been set up for error injection. Other types of programs are likely to be added in the future, but the coverage will never be universal. Many types of BPF programs are invoked from within contexts that, themselves, do not allow sleeping — deep within the network packet-processing code or attached to atomic functions, for example — so making those programs sleepable is just not going to happen.

Sleepable BPF programs are also limited in the types of maps they can access; anything but hash or array maps is out of bounds. There is a further restriction with hash maps: they must be preallocated so that elements need not be allocated while a sleepable program is running. Most of the internal housekeeping within maps is currently done using read-copy-update (RCU), a protection scheme that breaks down if a BPF program is blocked while holding a reference to a map entry. The hash and array maps have been extended to use RCU tasks, which can handle sleeping code, for some operations. Once again, it seems likely that support for the combination of BPF maps and sleepable programs will grow over time.

Sleepable BPF programs, thus, run with a number of restrictions that do not apply to the atomic variety. With Starovoitov's patch set, there is exactly one thing they can do that's unavailable to atomic BPF programs; it takes the form of a new helper function:

    long bpf_copy_from_user(void *dest, u32 size, const void *user_ptr);

This helper is a wrapper around the kernel's copy_from_user() function, which copies data from user space into the kernel. User-space data may not be resident in RAM when a call like this is made, so callers must always be prepared to block while one or more pages are brought in from secondary storage. This has prevented BPF programs from reading user-space data directly; now sleepable BPF programs will be able to do so. One potential use for this ability would be to allow security-related BPF programs to follow user-space pointers and get a better idea of what user space is actually up to.

This patch set is in its fifth revision as of this writing and seems likely to find its way into the mainline during the 5.9 merge window. After that, the restrictions on what sleepable BPF programs can do are likely to start getting in the way of users, so it would not be surprising to see work loosening those restrictions showing up relatively quickly. For some use cases, at least, BPF insomnia should soon be a thing of the past.

Index entries for this article
Kernel	BPF

"User-space data may not be resident in RAM"

Posted Jul 8, 2020 6:34 UTC (Wed) by epa (subscriber, #39769) [Link] (19 responses)

I note that the most recent Fedora release no longer creates a swap partition or swap file by default. Instead there is swapping to zram. But perhaps we are nearing the day when even that won't be necessary and systems won't have swap any more. If you also got rid of demand paging for executables, and assuming for the moment that memory-mapped files need not be handled, how much kernel could could be simplified out once you know that all user-space data is always in physical RAM?

"User-space data may not be resident in RAM"

Posted Jul 8, 2020 12:32 UTC (Wed) by k3ninho (subscriber, #50375) [Link] (1 responses)

>...how much kernel could could be simplified out once you know that all user-space data is always in physical RAM?
You're still going to LRU files listed in the VFS cache, and maybe that would go away when you've got enough spare RAM to pin a mount point into the cache -- operating (eg root and home partitions) from RAM at the cost of being write heavy keeping disk in fsync().

K3n.

"User-space data may not be resident in RAM"

Posted Jul 8, 2020 12:54 UTC (Wed) by epa (subscriber, #39769) [Link]

Unless I've misunderstood, the VFS cache is not user-space data. I meant if you can assume that all of a user process's address space (apart from memory-mapped files) is in RAM and so there is no possibility of having to wait for it to be paged in.

"User-space data may not be resident in RAM"

Posted Jul 8, 2020 13:59 UTC (Wed) by corbet (editor, #1) [Link] (15 responses)

That would be so hugely wasteful of RAM that I have a hard time seeing it ever happening. Programs have a tendency to never access large parts of their address spaces. And that's before you get into other tricks, like copy-on-write or memory shared with devices, that can cause faults. You'll be waiting a long time waiting for that particular simplification, I think.

"User-space data may not be resident in RAM"

Posted Jul 8, 2020 20:38 UTC (Wed) by epa (subscriber, #39769) [Link] (4 responses)

If you apply the rule of thumb that swap space should equal twice the physical RAM(*), then you only need triple the amount of RAM to eliminate swap altogether. That could happen in a year or two depending on how well Moore's Law applies to RAM production. Of course, software becomes more bloated at the same time, which is why we still have swap space, but I think the improvement in machine size is slowly outstripping the rate at which programmers can add bloat.

As far as I know, Android systems don't have swap space. I would expect most embedded Linux systems to do without it also. A spinning disk is too slow, and swapping to flash memory wears it out. Meanwhile supercomputers will normally run a specialized application designed to fit into physical RAM. It's the medium-sized systems that still use swap, as a kind of last resort before breaking out the OOM killer. I think this middle space is likely to shrink in the years to come.

I take your point that programs may allocate a huge address space and then use little of it. That's a slightly different issue from swap: no 64-bit system has a swap file big enough for the whole virtual address space that some process might make, and you can have huge (but mostly empty) address spaces with no swap at all. But it would invalidate the property that access to user memory can never block. If the page had been allocated but not touched, a page fault would still be needed on first use.

(*) I know this rule came from various old systems and Unix variants in simpler times. I use it as a rough example of a fairly generous allocation of swap space, not that I am advocating it on current systems. My point is that even if swap is much bigger than RAM, that's still less than an order of magnitude to overcome.

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 8:14 UTC (Thu) by farnz (subscriber, #17727) [Link] (2 responses)

My Android devices do have swap - it's backed by a zram compressed RAM device, but it's swap nonetheless.

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 11:03 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

Thanks for the correction. So Fedora is now catching up to Android in this regard.

I wonder whether for Android devices an old-fashioned task swapper might work better than demand paging, with an app that you've switched away from (and which does not run in the background) being swapped out to zram and swapped back in when you open it.

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 11:29 UTC (Thu) by farnz (subscriber, #17727) [Link]

Android effectively has that, by requiring all apps to be capable of writing their own state to storage and exiting while backgrounded. However, swap helps because it allows more processes to sit idle in memory, ready for an on-demand "instant" switch instead of a slightly slower restore from storage.

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 9:40 UTC (Thu) by gray_-_wolf (subscriber, #131074) [Link]

> eliminate swap altogether.

I like to be able to hibernate my machine though...

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 0:50 UTC (Thu) by Fowl (subscriber, #65667) [Link] (9 responses)

Maybe in a world without fork()...

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 6:28 UTC (Thu) by epa (subscriber, #39769) [Link] (8 responses)

Yes, the fork-exec idiom is a longstanding pain point for conservative memory management. If only posix_spawn() weren't so kitchen-sink-looking compared with the classical elegance of separate fork and exec primitives.

As for ordinary fork() where you won't immediately exec() afterwards, that might not spoil things, as it's only *writing* to user-space memory that can cause a copy-on-write fault, not reading it. And a system call to read bytes from a file into some memory may as well reserve the physical space in advance, rather than allocating it only once data comes back from the disk.

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 6:35 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> As for ordinary fork() where you won't immediately exec() afterwards, that might not spoil things, as it's only *writing* to user-space memory that can cause a copy-on-write fault, not reading it.
Unless you're writing from other threads...

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 8:55 UTC (Thu) by epa (subscriber, #39769) [Link] (2 responses)

You've reached the limits of my knowledge. Could you explain how writing from other threads would cause a copy-on-write page fault to happen when a page is *read* rather than when it is written?

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 13:28 UTC (Thu) by Fowl (subscriber, #65667) [Link]

The fault wouldn't occur on a read, that's true, but even if the child process doesn't write, the parent is still running with all its threads, which will likely write eventually, and those can then unexpectedly fault.

So you'd have to not write in *both* the parent and the child until the child execs.

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 17:54 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

The child and parent both become CoW, and fork() does not stop any other threads in the parent. These threads can very well keep on writing to the newly shared memory.

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 13:13 UTC (Thu) by Fowl (subscriber, #65667) [Link] (2 responses)

Ha, that's an interesting idea - fork_nocow(). Exist in the parent's address space until exec().

Hey, clone(2) already exists! A trampoline to spill all your registers to memory and you're done. Except for having to be careful not to leak any memory, as the memory would leak in the parent too, it's probably a friendlier environment than using regular fork - all your threads are still there. Oh and locks, who knows what's going on with that.

Oh and how does thread-local storage work? Can you move it between threads? Is it even common? Does it depends on the thread ID? Is it possible to fake the TID? I don't know

Hmm and how do you know when you can free the stack you've just allocated for this new out-of-process-but-same-address-space thread you've just started? And I bet python and other runtime-y things are not going to be happy with two threads that think they're the same. Maybe block the parent until the child exec s? I'm going to wave my magic wand and proclaim that if you did this you could then share the same stack on both sides of this new fork_nocow.

Anyway, I had fun spending 15 minutes thinking about it.

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 13:20 UTC (Thu) by Fowl (subscriber, #65667) [Link] (1 responses)

Oh I'm an idiot. fork has to return - of course you can't share a stack. Copying/COWing just the stack would mess up all those RAII and scope based resource cleanup things. More likely to work with a callback, but then you've thrown away the almost-compatibility with existing code that was the whole point.

Still, fun.

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 17:06 UTC (Thu) by JGR (subscriber, #93631) [Link]

Your fork_nocow() is probably closest to vfork(2), which is approximately the same as forking but without a new stack or any CoW. However vfork is only really useful for the child to immediately call exec().

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 21:39 UTC (Thu) by ibukanov (subscriber, #3942) [Link]

posix_spawn is a kitchen-sink because there is no way to create a suspended process, apply all necessary settings/syscalls to it and start it. Hence the need to communicate a lot of possible settings through a single system call.

"User-space data may not be resident in RAM"

Posted Jul 8, 2020 17:24 UTC (Wed) by zlynx (guest, #2285) [Link]

I think that might happen in the distant future when all storage is NVRAM. If your entire storage system *IS* RAM already, there would be no point in paging. The CPU would be handling it in hardware via level 3 or level 4 cache.