Yet another try at the BPF program allocator

By Jonathan Corbet
November 28, 2022

The BPF subsystem, which allows code to be loaded into the kernel from user space and safely executed in the kernel context, is bound to create a number of challenges for the kernel as a whole. One might not think that allocating memory for BPF programs would be high on the list of problems, but life (and memory management) can be surprising. The attempts to do a better job of providing space for compiled BPF code have, to date, only been partially successful; now Song Liu is back with a new approach to finish the job.

Small, transient, and numerous

The problem with BPF programs is that they tend to be small, are often transient, and can be numerous. That, alone, would not be hard for the kernel to deal with; the slab allocators are highly tuned for the efficient allocation and freeing of small objects. But BPF programs, being executable code, must be stored in memory that allows execute access, and that complicates the picture.

Any memory that is both executable and writable presents an attractive target for attackers, so the kernel goes well out of its way to prevent that combination from occurring; some architectures prohibit it entirely. The kernel's own code is loaded at boot time, made read-only, and (usually) never changed again. Loadable modules, which require the addition of kernel code at run time, complicate things a bit, but modules are relatively large and tend to be stable. The kernel will load the modules it needs shortly after boot, and the set of loaded modules will rarely change thereafter. As a result, even if the handling of loadable modules is not optimal, things normally work well enough anyway.

As noted above, though, BPF programs can come and go frequently, and there can be a lot of small programs in the system. All of this would be fine in the absence of the prohibition on memory that is both writable and executable; that restriction requires that memory holding BPF programs, which are executable, be made non-writable. That, in turn, requires changing the permissions in the "direct map", the range of kernel address space that (on 64-bit systems) maps all of the system's physical memory. Even if direct-map addresses are not used to access BPF memory (as would happen if the vmalloc() family of allocators is used to obtain it), the existence of a writable direct mapping to executable code would create a potential vulnerability.

The kernel's direct map uses huge-page mappings (of 1GB size when possible). Huge-page mappings reduce the pressure on the system's translation lookaside buffer (TLB) and improve the performance of the system overall. If a portion of the direct map must be made read-only, though, then the huge page the contains it must be split into smaller pages, fragmenting the direct map with a measurable impact on performance. Doing that once or twice might not be a big problem but, in a system where BPF programs come and go frequently, the impact on the direct map can be severe.

The smallness of BPF programs also turns out to be a problem. In older kernels, each BPF program was loaded into its own (4KB) page, meaning that, often, most of the page was wasted. If many of these programs are loaded, that wasted memory starts to add up.

In February, Liu set out to solve these problems. The "bpf_prog_pack" allocator worked by allocating 2MB huge pages from the kernel, then handing out portions of those pages for BPF programs as they are loaded. The concentration of multiple BPF programs into huge pages addressed both problems: it minimized fragmentation of the direct map and reduced memory waste by packing BPF programs together in the same page. This allocator looked like a good solution and was quickly pulled into the mainline during the 5.18 merge window.

Unfortunately, a number of problems quickly surfaced, and much of the bpf_prog_pack functionality was backed out despite the fact that the source of some of the trouble was to be found in the memory-management subsystem. The allocator is still present in the kernel, but it uses 4KB "base" pages, so it does not help performance as much as it could.

Trying again

Liu's new proposal replaces bpf_prog_pack with a new allocator that addresses the complaints about the previous version and, once again, uses huge pages to hold BPF programs. That leads to improved performance:

Based on our experiments, we measured 0.5% performance improvement from bpf_prog_pack. This patchset further boosts the improvement to 0.7%. The difference is because bpf_prog_pack uses 512x 4kB pages instead of 1x 2MB page.

The use of 2MB pages is now possible as the result of fixing the related problems in the memory-management subsystem. This new allocator goes beyond the use of huge pages, though, and creates a new API for the management of transient, executable code in the kernel:

    void *execmem_alloc(unsigned long size, unsigned long align);
    void *execmem_fill(void *dst, void *src, size_t len);
    void execmem_free(void *addr);

Any kernel subsystem that needs to set up a segment of executable code can allocate the memory with execmem_alloc(). The memory that is returned will have read-only protection, so the caller cannot copy the code into it directly. Instead, execmem_fill() must be called to populate this memory with the executable text. On the x86 architecture (the only one that supports this mechanism now), the "text_poke" machinery will be used to safely copy the code while dodging the many race conditions that can present themselves when code is being modified. If a range of executable memory is no longer needed, it can be returned with execmem_free().

The advantage of this new API is that it is not limited to just BPF programs; it can also potentially be used in other places where code is loaded into the kernel — specifically for loadable modules. That would improve the efficiency of those allocations while simultaneously reducing the number of code-loading implementations in the kernel. That seems like a significant benefit, but there is just one little problem: the module loader has not been changed to actually use this API, so there is no proof that it will work in that context.

Indeed, it almost certainly will not work for module loading yet, simply because there is no support for any architectures other than x86. Loading code into a running kernel is a tricky business, and the details of how it can be done safely vary widely from one architecture to the next. A number of architectures now implement at least parts of the text_poke API, which simplifies the task, but text_poke is not universal; arm64 does not support it, for example. Architectures also have differing requirements around the placement of data areas for modules; it may not work to put a module's BSS memory far away from its text, for example. All of this adds up to a number of potential headaches for anybody trying to actually use the new API for module loading.

Reviewers of this work would, understandably, like some assurance that the new API can work beyond BPF before accepting it; Mike Rapoport, for example, has asked for "at least some explanation how modules etc can use execmem_ APIs without breaking !x86 architectures". Rick Edgecombe responded with an assertion that other architectures could be supported with minor changes to the API, but questioned whether it it is truly necessary to solve the whole problem at this point.

Luis Chamberlain has also expressed frustration at the lack of solid (and reproducible) data showing how this work improves system performance. He clearly sees some advantages overall, though, since one of his complaints is that the patch changelogs do not sufficiently highlight "the gains of this effort helping with the long term advantage of centralizing the semantics for permissions on memory". Liu has responded with a bit more data on TLB-miss improvement.

The benefits of the work seem clear, should it manage to not run into surprises like its predecessor. The biggest question with regard to merging would seem to be just how much work will be required to convince reviewers that this API can handle the module case. If a complete solution is required, the new BPF program allocator seems unlikely to land anytime soon. Since there are no user-space API issues to resolve, though, it should be possible to proceed with the BPF solution once reviewers are convinced that it does not actively lead in the wrong direction.

Index entries for this article
Kernel	BPF/Memory management

Update about the status of the patch

Posted Dec 2, 2022 13:27 UTC (Fri) by jejb (subscriber, #6654) [Link]

Just for people who find this via google, the debate has now evolved to the point where tglx has NACK'd this patch without a solution to the broader executable memory allocation that would cover modules and more:

https://lore.kernel.org/linux-mm/871qpluxfu.ffs@tglx/

So more patches to address the underlying issue of executable memory allocation may be expected.

Yet another try at the BPF program allocator

Posted Jan 5, 2024 20:58 UTC (Fri) by bland (guest, #168935) [Link]

Note that BPF on older ARM kernels does run into odd allocation location behaviors due to the use of the kernel's module_alloc infrastructure and is tough to write kernel modules around. Thankfully, recent efforts by Mark Rutland on v6.4-rc3 have led to kernel module comprehensible memory allocation locations for BPF programs on ARM!