Improved load-time checking for BPF kfuncs
Every BPF program is assigned a specific type; the full list of types can be found in Documentation/bpf/libbpf/program_types.rst. So, for example, BPF_PROG_TYPE_FLOW_DISSECTOR is for programs that implement network flow dissectors, while BPF_PROG_TYPE_LSM is for programs run from the BPF Linux Security Module. The kernel will not allow a program to be attached to a BPF hook if the program is not of the correct type for that hook. This restriction prevents BPF programs from being invoked from contexts that they are not designed for.
The kernel's kfunc mechanism is a relatively recent addition that allows any function within the kernel to be made available for direct calling from a BPF program. Here, too, it is important that kfuncs are only called from the correct context. So, whenever a set of kfuncs is registered with the BPF subsystem (using a call to register_btf_kfunc_id_set()), the program type must be supplied; the verifier will ensure that only programs of the given type can call a kfunc from that set.
This machinery works but has come under increasing strain over the years. The number of program types has grown considerably, and that has led to a desire to restrain that growth (without, of course, slowing the incursion of BPF into the few parts of the kernel where it is not yet found). That has resulted in BPF program types becoming more generic; in particular, the "struct ops" mechanism allows a BPF program to provide a structure full of functions that the kernel can call, under the BPF_PROG_TYPE_STRUCT_OPS program type. There are quite a few programs out there of this type that run in all kinds of contexts.
Any verification mechanism that relies on just the program type will be unable to tell one struct-ops program from another. Beyond that, though, there are reasons to treat the different functions called within a single struct-ops program as having different contexts. The BPF subsystem is currently unable to make that distinction, and that has complicated life.
In this patch set, Deng pointed to the sched_ext subsystem, which allows CPU schedulers to be written in BPF, as an example of this problem. A sched_ext program is of type BPF_PROG_TYPE_STRUCT_OPS; when it loads, it provides a structure of type sched_ext_ops with about three-dozen different function pointers, each for a callback that handles one aspect of the scheduling problem. The runnable() callback, for example, is invoked when a task becomes runnable and must be placed into a run queue, while cpu_offline() is called if a CPU is being removed from the system and tasks must be moved off of it. Clearly, the context in which these callbacks are called will vary considerably from one to the next.
The sched_ext subsystem also provides a number of kfuncs that allow BPF programs to perform scheduling tasks, such as putting a task onto a specific CPU. It is only appropriate to call some of those kfuncs from specific sched_ext_ops functions, though; they only make sense during the appropriate parts of the scheduling flow. To avoid problems, the sched_ext subsystem must track which BPF function is being called at the moment and, when a kfunc is invoked, ensure (with a call to scx_kf_allowed()) that the calling context is correct. This is an extra run-time check that would be more nicely done at load time; evidently this check is expensive enough to impact the performance of sched_ext schedulers.
Deng's first solution to this problem was to add a "capabilities" mechanism to the BPF subsystem. A capability mask (a 32-bit integer value) was added to each kfunc; bits set in that mask would indicate the capabilities needed to be able to call the kfunc. Each kfunc could then be registered along with the requisite capabilities. The patch set also provided a new callback (bpf_capabilities_adjust()) that would allow a subsystem (such as sched_ext) to specify which capabilities are held by a BPF program that it might run. This callback is invoked separately by the verifier during the checking of each BPF_PROG_TYPE_STRUCT_OPS function, allowing each to be provided with separate capabilities that may (or may not) make a given kfunc available. The end result is that the verifier gained the ability to prevent inappropriate kfunc calls at load time, and the run-time overhead was eliminated.
This implementation raised some concerns, though. Tejun Heo pointed out that a 32-bit mask for capabilities would surely be exhausted at some point. He also wondered if it was necessary to declare capabilities globally at all; given that the callback was needed in any case, it could just accept context information and make purely local decisions at that time. Alexei Starovoitov took issue with the term "capabilities", which already has a well-defined meaning in the kernel, but he also thought that the concept was unnecessary. He suggested just implementing this functionality as a filter callback instead.
Filters are a similar mechanism that were added during the 6.5 cycle by Aditi Ghag. They are yet another callback that is associated with each kfunc. The verifier will call this filter() function, if it exists, to determine whether a given call should be allowed or not. This functionality does seem quite similar to what capabilities brought, but Deng had some reservations about using it; in particular, filter functions do not work with kernel subsystems implemented as loadable modules. Starovoitov, though, was unworried about this restriction; the current user for this feature is sched_ext, which cannot be built as a module. The immediate problem (the performance impact on sched_ext) should be solved first, he said; other concerns can be addressed later if the need arises.
So Deng solved this problem anew, specifically for sched_ext, using the filter functionality. The new series adds some context information to the (already large) bpf_prog_aux structure that is available to filter functions. This context consists of a pointer to the operations structure itself and the byte offset within the structure of the specific function being called. Filter functions can use that information to determine which struct-ops function is being called and make a decision about whether that call should be allowed. As can be seen from this patch, for example, this mechanism is arguably not the most elegant ever, but it does get the job done.
In any case, it is sufficient to add load-time checks for sched_ext and eliminate the need for the run-time checks, once again addressing the performance problem. There are some residual glitches; if a kfunc appears in more than one exported set (which happens reasonably often when kfuncs must be exported to more than one program type), the filter no longer knows which offset to use and can make incorrect decisions. That causes at least one sched_ext program to fail to load with the current patch set. There are ways to work around this problem, including adding a simple wrapper for the kfunc to distinguish between the calling contexts, but this seems like the kind of trap that can easily snare unwary developers.
This new series is fresh as of this writing and has not yet generated any
discussion. It does appear to have addressed the concerns raised the first
time around, though, and to solve the immediate problem. So, while
"capabilities" will not be coming to BPF programs anytime soon, better
load-time decisions on the validity of kfunc calls would appear to be on
the horizon.
Index entries for this article | |
---|---|
Kernel | BPF |
Kernel | Scheduler/Extensible scheduler class |