Sleepable BPF programs
BPF programs can now do many things that were not possible for classic BPF programs, including calling helper functions in the kernel, accessing data structures ("maps") shared with the kernel or user space, and synchronizing with spinlocks. The core assumption that BPF programs are atomic has not changed, though. Once the kernel jumps into a BPF program, that program must complete without doing anything that might put the thread it is running in to sleep. BPF programs themselves have no way of invoking any sort of blocking action, and the helper functions exported to BPF programs by the kernel are required to be atomic.
As BPF gains functionality and grows toward some sort of sentient singularity moment, though, the inability to block is increasingly getting in the way. There has, thus, been interest in making BPF programs sleepable for some time now, and that interest has recently expressed itself as code in the form of this patch set from Alexei Starovoitov.
The patch adds a new flag, BPF_F_SLEEPABLE, that can be used when loading BPF programs into the kernel; it marks programs that may sleep during their execution. That, in turn, informs the BPF verifier about the nature of the program, and brings a number of new restrictions into effect. Most of these restrictions are the result of the simple fact that the BPF subsystem was never designed with sleepable programs in mind. Parts of that subsystem have been updated to handle sleeping programs correctly, but many other parts have not. That is likely to change over time but, until then, the functionality implemented by any part of the BPF subsystem that still expects atomicity is off-limits to sleepable programs.
For example, of the many types of BPF programs supported by the kernel, only two are allowed to block: those run from the Linux security module subsystem and tracing programs (BPF_PROG_TYPE_LSM and BPF_PROG_TYPE_TRACING). Even then, tracing programs can only sleep if they are attached to security hooks or are attached to functions that have been set up for error injection. Other types of programs are likely to be added in the future, but the coverage will never be universal. Many types of BPF programs are invoked from within contexts that, themselves, do not allow sleeping — deep within the network packet-processing code or attached to atomic functions, for example — so making those programs sleepable is just not going to happen.
Sleepable BPF programs are also limited in the types of maps they can access; anything but hash or array maps is out of bounds. There is a further restriction with hash maps: they must be preallocated so that elements need not be allocated while a sleepable program is running. Most of the internal housekeeping within maps is currently done using read-copy-update (RCU), a protection scheme that breaks down if a BPF program is blocked while holding a reference to a map entry. The hash and array maps have been extended to use RCU tasks, which can handle sleeping code, for some operations. Once again, it seems likely that support for the combination of BPF maps and sleepable programs will grow over time.
Sleepable BPF programs, thus, run with a number of restrictions that do not apply to the atomic variety. With Starovoitov's patch set, there is exactly one thing they can do that's unavailable to atomic BPF programs; it takes the form of a new helper function:
long bpf_copy_from_user(void *dest, u32 size, const void *user_ptr);
This helper is a wrapper around the kernel's copy_from_user() function, which copies data from user space into the kernel. User-space data may not be resident in RAM when a call like this is made, so callers must always be prepared to block while one or more pages are brought in from secondary storage. This has prevented BPF programs from reading user-space data directly; now sleepable BPF programs will be able to do so. One potential use for this ability would be to allow security-related BPF programs to follow user-space pointers and get a better idea of what user space is actually up to.
This patch set is in its fifth revision as of this writing and seems likely
to find its way into the mainline during the 5.9 merge window. After that,
the restrictions on what sleepable BPF programs can do are likely to start
getting in the way of users, so it would not be surprising to see work
loosening those restrictions showing up relatively quickly. For some use
cases, at least, BPF insomnia should soon be a thing of the past.
Index entries for this article | |
---|---|
Kernel | BPF |
Posted Jul 8, 2020 6:34 UTC (Wed)
by epa (subscriber, #39769)
[Link] (19 responses)
Posted Jul 8, 2020 12:32 UTC (Wed)
by k3ninho (subscriber, #50375)
[Link] (1 responses)
K3n.
Posted Jul 8, 2020 12:54 UTC (Wed)
by epa (subscriber, #39769)
[Link]
Posted Jul 8, 2020 13:59 UTC (Wed)
by corbet (editor, #1)
[Link] (15 responses)
Posted Jul 8, 2020 20:38 UTC (Wed)
by epa (subscriber, #39769)
[Link] (4 responses)
As far as I know, Android systems don't have swap space. I would expect most embedded Linux systems to do without it also. A spinning disk is too slow, and swapping to flash memory wears it out. Meanwhile supercomputers will normally run a specialized application designed to fit into physical RAM. It's the medium-sized systems that still use swap, as a kind of last resort before breaking out the OOM killer. I think this middle space is likely to shrink in the years to come.
I take your point that programs may allocate a huge address space and then use little of it. That's a slightly different issue from swap: no 64-bit system has a swap file big enough for the whole virtual address space that some process might make, and you can have huge (but mostly empty) address spaces with no swap at all. But it would invalidate the property that access to user memory can never block. If the page had been allocated but not touched, a page fault would still be needed on first use.
(*) I know this rule came from various old systems and Unix variants in simpler times. I use it as a rough example of a fairly generous allocation of swap space, not that I am advocating it on current systems. My point is that even if swap is much bigger than RAM, that's still less than an order of magnitude to overcome.
Posted Jul 9, 2020 8:14 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (2 responses)
My Android devices do have swap - it's backed by a zram compressed RAM device, but it's swap nonetheless.
Posted Jul 9, 2020 11:03 UTC (Thu)
by epa (subscriber, #39769)
[Link] (1 responses)
I wonder whether for Android devices an old-fashioned task swapper might work better than demand paging, with an app that you've switched away from (and which does not run in the background) being swapped out to zram and swapped back in when you open it.
Posted Jul 9, 2020 11:29 UTC (Thu)
by farnz (subscriber, #17727)
[Link]
Android effectively has that, by requiring all apps to be capable of writing their own state to storage and exiting while backgrounded. However, swap helps because it allows more processes to sit idle in memory, ready for an on-demand "instant" switch instead of a slightly slower restore from storage.
Posted Jul 9, 2020 9:40 UTC (Thu)
by gray_-_wolf (subscriber, #131074)
[Link]
I like to be able to hibernate my machine though...
Posted Jul 9, 2020 0:50 UTC (Thu)
by Fowl (subscriber, #65667)
[Link] (9 responses)
Posted Jul 9, 2020 6:28 UTC (Thu)
by epa (subscriber, #39769)
[Link] (8 responses)
As for ordinary fork() where you won't immediately exec() afterwards, that might not spoil things, as it's only *writing* to user-space memory that can cause a copy-on-write fault, not reading it. And a system call to read bytes from a file into some memory may as well reserve the physical space in advance, rather than allocating it only once data comes back from the disk.
Posted Jul 9, 2020 6:35 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Posted Jul 9, 2020 8:55 UTC (Thu)
by epa (subscriber, #39769)
[Link] (2 responses)
Posted Jul 9, 2020 13:28 UTC (Thu)
by Fowl (subscriber, #65667)
[Link]
So you'd have to not write in *both* the parent and the child until the child execs.
Posted Jul 9, 2020 17:54 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jul 9, 2020 13:13 UTC (Thu)
by Fowl (subscriber, #65667)
[Link] (2 responses)
Hey, Oh and how does thread-local storage work? Can you move it between threads? Is it even common? Does it depends on the thread ID? Is it possible to fake the TID? I don't know
Hmm and how do you know when you can free the stack you've just allocated for this new out-of-process-but-same-address-space thread you've just started? And I bet python and other runtime-y things are not going to be happy with two threads that think they're the same. Maybe block the parent until the child Anyway, I had fun spending 15 minutes thinking about it.
Posted Jul 9, 2020 13:20 UTC (Thu)
by Fowl (subscriber, #65667)
[Link] (1 responses)
Still, fun.
Posted Jul 9, 2020 17:06 UTC (Thu)
by JGR (subscriber, #93631)
[Link]
Posted Jul 9, 2020 21:39 UTC (Thu)
by ibukanov (subscriber, #3942)
[Link]
Posted Jul 8, 2020 17:24 UTC (Wed)
by zlynx (guest, #2285)
[Link]
"User-space data may not be resident in RAM"
"User-space data may not be resident in RAM"
You're still going to LRU files listed in the VFS cache, and maybe that would go away when you've got enough spare RAM to pin a mount point into the cache -- operating (eg root and home partitions) from RAM at the cost of being write heavy keeping disk in fsync().
"User-space data may not be resident in RAM"
That would be so hugely wasteful of RAM that I have a hard time seeing it ever happening. Programs have a tendency to never access large parts of their address spaces. And that's before you get into other tricks, like copy-on-write or memory shared with devices, that can cause faults. You'll be waiting a long time waiting for that particular simplification, I think.
"User-space data may not be resident in RAM"
"User-space data may not be resident in RAM"
"User-space data may not be resident in RAM"
"User-space data may not be resident in RAM"
"User-space data may not be resident in RAM"
"User-space data may not be resident in RAM"
"User-space data may not be resident in RAM"
"User-space data may not be resident in RAM"
"User-space data may not be resident in RAM"
Unless you're writing from other threads...
"User-space data may not be resident in RAM"
"User-space data may not be resident in RAM"
"User-space data may not be resident in RAM"
Ha, that's an interesting idea - "User-space data may not be resident in RAM"
fork_nocow()
. Exist in the parent's address space until exec()
.
clone(2)
already exists! A trampoline to spill all your registers to memory and you're done. Except for having to be careful not to leak any memory, as the memory would leak in the parent too, it's probably a friendlier environment than using regular fork - all your threads are still there. Oh and locks, who knows what's going on with that.
exec
s? I'm going to wave my magic wand and proclaim that if you did this you could then share the same stack on both sides of this new fork_nocow
.
Oh I'm an idiot. "User-space data may not be resident in RAM"
fork
has to return - of course you can't share a stack. Copying/COWing just the stack would mess up all those RAII and scope based resource cleanup things. More likely to work with a callback, but then you've thrown away the almost-compatibility with existing code that was the whole point.
"User-space data may not be resident in RAM"
"User-space data may not be resident in RAM"
"User-space data may not be resident in RAM"