Leading items
Welcome to the LWN.net Weekly Edition for July 13, 2023
This edition contains the following feature content:
- Large folios for anonymous memory: there are advantages to enabling larger allocations for anonymous memory, but some challenges remain.
- A pair of workqueue improvements: two improvements to the workqueue mechanism, one of which made it into 6.4 and one of which did not.
- The rest of the 6.5 merge window: a summary of the most interesting changes pulled in the latter half of this merge window.
- The last installment of LSFMM+BPF coverage:
- BPF iterators for filesystems: a way to use BPF iterators to retrieve various kinds of filesystem information.
- The FUSE BPF filesystem: a FUSE implementation that uses BPF to reduce the number of transitions in and out of the kernel.
- Testing for storage and filesystems: what can be done to improve the kernel's infrastructure for the testing of storage and filesystem changes?
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Large folios for anonymous memory
The transition to folios has transformed the memory-management subsystem in a number of ways, but has also resulted in a lot of code churn that has not been welcomed by all developers. As this work proceeds, though, some of the benefits from it are beginning to become clear. One example may well be in the handling of anonymous memory, as can be seen in a pair of patch sets from Ryan Roberts.The initial Linux kernel release used 4KB pages on systems whose total memory size was measured in megabytes — and a rather small number of megabytes at that. Since then, installed-memory sizes have grown by a few orders of magnitude or so, but the 4KB page size remains mostly unchanged. So the kernel has to manage far more pages than it once did; that leads to more memory used for tracking, longer lists to scan, and more page faults to handle. In many ways, a 4KB page size is far too small for contemporary systems.
Some architectures support running with larger page sizes, and any system could emulate larger pages by clustering the existing base pages. But a larger page size has its own problem: internal fragmentation that can waste a significant amount of memory. In practice, this problem has been severe enough to keep 4KB pages around, despite their drawbacks.
One of the key advantages that folios bring is that they make the system's base page size less important; a folio can have any size (as long as it is a power of two) and kernel code will do the right thing with it. That allows different sizes to be used in different settings, as appropriate.
Anonymous folios
Roberts's large folios for anonymous memory patch set takes advantage of this flexibility to improve the management of anonymous pages — pages associated with program data and not backed by a file on disk. At its core, the change is simple; whenever the kernel is called upon to map a page of anonymous memory for a process, it tries to allocate and map a larger folio instead. By default, the code will try to allocate and map a 64KB folio, dropping down to smaller sizes if that cannot be done; there is a hook that allows architecture-specific code to specify a different default.
Since anonymous memory starts out filled with zeroes, mapping it in larger chunks is not particularly hard; there is no extra I/O that must be done. The biggest advantage for the kernel is that mapping large folios can significantly reduce the number of page faults that must be handled. If a single fault results in the mapping of a 64KB folio, that memory range can be accessed with just that one fault, rather than the 16 that would otherwise be required when mapping 4KB base pages.
Of course, it is not always possible to map a larger folio in that way. If a physically contiguous chunk of memory with a suitable size and alignment is not available, then the attempt will fail. It is also not possible to map a folio that extends beyond the virtual memory area (VMA) that contains it. If there are pages already mapped in a part of the address range that a folio would cover then, once again, that folio cannot be used. The ability to transparently drop down to smaller sizes means that the kernel can use an allocation that is suited to the conditions it finds at the time. Among other things, that helps to avoid internal fragmentation with small mappings.
Benchmarks running the most important workload of all — compiling the kernel — show an approximately 5% reduction in the time needed, with a reduction in kernel time of about 40%. That, alone, suggests that this work may be a good idea, but there are more gains to be had on current hardware.
Reducing TLB pressure
Virtual-address translation is a complicated process; it involves stepping through three to five levels of page tables, perhaps incurring cache misses at each step. The CPU tries to avoid this expense whenever possible by maintaining a cache of recent translations in the translation lookaside buffer (TLB). To a surprising extent, an application's performance will be determined by how well it fits into the TLB; a lot of TLB misses will slow things considerably. Unfortunately, TLB memory is expensive, so the cache is never as big as one might like it to be.
One important technique for stretching the TLB is the use of huge pages, which can allow 2MB (or even 1GB) of memory to be covered by a single TLB entry. Huge pages are, however, huge; they can be difficult to allocate on a busy system and can create huge internal-fragmentation problems of their own. The smaller folios used by Roberts's patch are much easier to manage, but they don't provide the same TLB benefits that huge pages do.
Or, at least, that was once the case. More recent CPUs have started adding a bit to their page-table entries to indicate that a small range of pages has been placed in physically contiguous memory. The processor can use that information to collapse the TLB entries referring to those pages into a single entry; the benefit is not as large as with a full huge page, but it is also much easier to obtain. This benefit will only be enjoyed, though, if the kernel sets the "contiguous PTE" bit where the mapping is truly contiguous.
The second patch set from Roberts does exactly that, for the arm64 architecture at least. In an amazing coincidence, arm64 systems can map a contiguous range of up to 64KB — which just happens to be the default folio size set for arm64 in the first patch series — into a single TLB entry. With this series applied, contiguous ranges of pages are detected automatically, and the appropriate page-table bits will be set. That results in another 2% gain for the kernel-compilation benchmark.
Discussion
These gains will only happen if this code is merged into the mainline kernel, though. That seems likely to happen, but there will be some changes needed first. For example, Yu Zhao has complained about the architecture-specific function to set the default folio size. That function takes the faulting VMA as a parameter; Zhao feels that the result is a mixture of architecture-specific decision making with policy that should be managed by the core memory-management code. Roberts has indicated that he is willing to change that interface.
Zhao also dislikes
the practice of trying intermediate sizes if the desired folio size cannot
be used. The work, he said, would "be a lot easier to sell
" if it
fell back immediately to the base-page size. As was explained in the
anonymous-folio cover letter, Zhao has recommended this change in the
past, and Roberts tried it; the result was worse performance on some
benchmarks. So he seems less willing to give on this point. When asked,
Zhao gave
three reasons for his dislike of the intermediate fallback, with the
most significant being that it may cause system-wide fragmentation:
The third one is why I really don't like this best-fit policy. By falling back to smaller orders, we can waste a limited number of physically contiguous pages on wrong vmas (small vmas only), leading to failures to serve large vmas which otherwise would have a higher overall ROI.
A possible compromise would be to attempt a single fallback to the size known as PAGE_ALLOC_COSTLY_ORDER, which is 32KB by default, before giving up and mapping base pages. In other words, this policy would avoid allocating relatively small (but still larger than single-page) folios that might break up larger, physically contiguous ranges of memory.
Another concern is that this work — and the benchmarking that comes with it — is all specific to the arm64 architecture. Support for physically contiguous page-table entries is coming to x86 processors as well, so this feature will eventually need to work beyond arm64. That suggests that a favorable review from the x86 community will be a necessary precondition to getting this work merged. Intel developer Yin Fengwei has been participating in the discussion and has indicated that some, but not all, of the patches seem ready.
The biggest stumbling block, though, may be that large anonymous folios are not yet fully integrated into the rest of the memory-management subsystem. As mentioned in one changelog in the series:
The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which defaults to disabled for now; there is a long list of todos to make FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some madvise ops, etc). These items will be tackled in subsequent patches.
Roberts has posted a more detailed list of things that need to be fixed and indicated that he would prefer to merge the feature, disabled by default, and deal with the remaining problems afterward. But, as Matthew Wilcox pointed out, there will be little desire to merge a patch set that still has that kind of outstanding issue, so these problems will almost certainly need to be worked out before this feature can be considered ready for the mainline.
This work suggests that the debate over whether the kernel's page size should be increased is over; the answer is to use the size that works best in each situation rather than using a single page size everywhere. The folio work has given the kernel some of the flexibility needed to adopt a policy like that. There is a gap, though, between the ability to implement such a feature and creating a feature that can be deployed in production kernels. Future kernels will almost certainly be capable of mapping variably sized anonymous folios, but getting to that point may take a while yet.
A pair of workqueue improvements
Over the years, the kernel has developed a number of deferred-execution mechanisms to take care of work that cannot be done immediately. For many (or most) needs, the workqueue subsystem is the tool that developers reach for first. Workqueues took their current form over a dozen years ago, but that does not mean that there are not improvements to be made. Two sets of patches from Tejun Heo show the pressures being felt by the workqueue subsystem and the solutions that are being tried — with varying degrees of success.In normal usage, each subsystem creates its own workqueue (with alloc_workqueue()) to hold work items. When kernel code needs to defer a task, it can fill in a work_struct structure with the address of a function to call and some data to pass to that call. That structure can be passed, along with the target workqueue, to a function like queue_work(), and the workqueue mechanism will call the function at some future time. The call is made in process context, meaning that work items can block if need be. There is, of course, a long list of variants to queue_work(), and a number of ways in which workqueues themselves can be created, but the core functionality — call a function in process at a later time — remains the same.
Once upon a time, each workqueue had one or more kernel threads associated with it. As long as there were work items on a queue, the threads would remove and execute those items. The problem with this implementation is that the kernel contains a large number of workqueues, and they can end up processing a lot of work items. That resulted in systems containing many worker threads, all competing with each other.
The "concurrency-managed workqueue" mechanism found in current kernels, also created by Heo, was designed to address these problems. Workqueues no longer have dedicated kernel threads associated with them; instead, a globally managed set of threads runs items from all workqueues. The workqueue subsystem tries to have exactly one work item running on each CPU at any given time — if, of course, there are that many items in need of execution. Once one work item completes, another is dispatched in its place.
There is one other complication: since work items are allowed to block, any given work item could be "running" but not actually runnable for long periods of time; that could result in a CPU going idle while there are other work items waiting to be run. The workqueue mechanism handles this case by arranging to be notified whenever one of its worker items blocks. When that happens, another work item will be dispatched (with another thread created to run it, if needed) so that the CPU remains busy. Once the blocked worker wakes up, the workqueue core will notice and stop dispatching items while that worker runs.
Detecting CPU-intensive workers
This mechanism handles the case where a work item blocks, but there is another potentially problematic case. If a work item runs for a long time, it will block any others from running on the same CPU, leading to the starvation of other work items. There is a flag (WQ_CPU_INTENSIVE) that can be set when a workqueue is created to indicate that the work items running there may run for a long time; that flag causes the workqueue to be run outside of the normal concurrency-management mechanism so that it doesn't block other workers. Developers are often surprised by which parts of their code take the most CPU time, though, so it is easy to not remember to set that flag when creating a workqueue. As a result, kernel developers occasionally find themselves tracking down performance problems created by CPU-hog work items.
This patch set provides a relatively simple solution to this problem. The workqueue core will, on a regular basis, look at which workers are running on each CPU. If any given worker is found to have run without blocking for a long time (defined as 10ms by default), it will be marked as being CPU-intensive and taken out of the concurrency-management regime, allowing other workers to run. There is an option to have the kernel report work functions that repeatedly are marked in this way; developers can use that information to mark the workqueue from which they are run appropriately.
This new machinery also makes it relatively easy to track how much CPU time each work item is using. This information has been made available to user space, allowing developers to see how much time their workqueues are consuming.
This work was pulled into the mainline during the 6.5 merge window.
Binding unbound workqueues
The discussion to this point has ignored the existence of unbound
workqueues, which are created with the WQ_UNBOUND flag. These
workqueues are documented as running "workers which are not bound to any
specific CPU
", and which are not part of the above-described
concurrency-management regime. Unbound workqueues are described as being
suited for CPU-intensive tasks that are better managed by the CPU
scheduler.
In practice, as described in this patch set, unbound workqueues have not been fully unbound for a while; instead, the workqueue mechanism tries to contain them within a NUMA node. That increases the locality of the workqueue, improving performance. However, it seems that, on current CPUs, NUMA affinity is not enough. A single NUMA node might now contain multiple L3 caches; spreading work across a node will thus spread it across multiple caches, losing some of the locality that NUMA affinity was meant to produce. This has led to a number of complaints about workqueue performance.
Fixing this problem, it seems, is not easy, and Heo has concluded that
"there is not going to be one good enough behavior for everyone
".
So, instead, the patch set creates three new parameters that can be set for
each workqueue:
- The "affinity scope", describing the boundaries that should be applied to an unbound workqueue. There are five possible values, binding queues to a single CPU, to a CPU and its siblings, to all CPUs sharing the same L3 cache, to a NUMA node, or to the system as a whole. The NUMA binding matches current workqueue behavior.
- The "affinity strictness": how strongly the workqueue is bound to its given scope. With strict affinity, work items cannot run outside of their scope. With non-strict affinity, work items will be started within their scope, but the scheduler will be able to move them outside if that improves the performance of the system overall.
- "Localization": if set, work items are always started on the CPU where they were queued; after that, they can be moved as described by the scope and strictness parameters.
Heo included some benchmarks showing the effects of various combinations of
parameters. Changing the localization parameter has proved not to be
helpful, and he suggested that it may eventually be dropped from the
series. The others gave some small gains depending on the specific
workload being run. The overall picture is less than fully clear or, as
Heo put it: "The tradeoff between efficiency gains from improved
locality and bandwidth loss from work-conservation deficit poses a
conundrum
".
Linus Torvalds initially responded that this work looked overly focused on throughput while ignoring latency, which he regards as being more important. He was later convinced by Heo, though, that this work could improve both throughput and latency. Brian Norris, who is one of the developers reporting performance problems with current kernels, had the changes tried but failed to note any performance improvements — results that Heo found mystifying. Torvalds suggested that the problem might be a bug elsewhere in the workqueue code.
As of this writing, these workqueue performance problems remain unresolved. It is thus not surprising that this set of patches was not pushed for the 6.5 release. Developers are going to have to dig deeper to figure out why some current system architectures are creating performance problems for workqueues.
The rest of the 6.5 merge window
Linus Torvalds released 6.5-rc1 and closed the merge window for this development cycle on July 9. By that point, 11,730 non-merge changesets had been pulled into the mainline for 6.5; over 7,700 of those were pulled after the first-half merge-window summary was written. The second half of the merge window saw a lot of code coming into the mainline and a long list of significant changes.The most interesting changes pulled in the latter part of the merge window include:
Architecture-specific
- The Loongarch architecture has gained support for simultaneous multi-threading (SMT) and building with the Clang compiler.
- RISC-V now supports ACPI and the Vector extension.
Core kernel
- The function-graph tracer can now record and report the return value from functions; this documentation commit describes how to use this feature.
- The timer-latency tracer can now be controlled and queried from user space; this commit contains some relevant documentation.
- "fprobe events" are a new mechanism for tracing function entry and
exit that is better supported on more architectures; this commit has
some more information and this one has a
new document.
These events can also be used to easily trace raw tracepoints that lack a trace-event declaration. Raw tracepoints were created to make it harder to get at deeply internal kernel features, thus making it less likely that user space would come to rely on them; fprobe events are now making it easier again.
Filesystems and block I/O
- The overlay filesystem has gained support for data-only layers, which is needed for composefs and similar use cases. This commit contains a small documentation update.
- Overlayfs has also been ported to the new mount API.
- The F2FS filesystem has a new errors= mount option to control the response to media errors; this commit contains some more information.
Hardware support
- Clock: Amlogic A1 SoC PLL controllers, Amlogic A1 SoC peripherals clock controllers, Nuvoton MA35D1 clock controllers, Qualcomm SM8350, SM8450, and SM8550 video clock controllers, Qualcomm SDX75 global clock controllers, Qualcomm SM8450 graphics clock controllers, and Qualcomm SC8280 low power audio subsystem clock controllers.
- GPIO and pin control: TI TPS65219 GPIO controllers, Mellanox BlueField 3 SoC GPIO controllers, STMicroelectronics STM32MP257 pin controllers, Qualcomm IPQ5018, SDX65 and SDX75 pin controllers, and NVIDIA Tegra234 pin controllers.
- Graphics: Samsung S6D7AA0 MIPI-DSI video mode panel controllers and Amlogic Meson MIPI DSI Synopsys controllers.
- Hardware monitoring: HP WMI sensors.
- Industrial I/O: Texas Instruments OPT4001 light sensors, Honeywell MPRLS0025PA pressure sensors, and ROHM BU27008 color sensors.
- Input: NVIDIA SHIELD devices.
- Miscellaneous: StarFive JH7110 cryptographic engines, CXL 3.0 performance monitoring units, Richtek RT5033 battery chargers, Analog Devices MAX77541/77540 power-management ICs, Intel Cherry Trail Whiskey Cove LED controllers, Awinic AW20036/AW20054/AW20072 LED drivers, TI TPS6594 error signal monitors, TI TPS6594 pre-configurable finite state machines, Nuvoton MA35D1 family UARTs, AMD/Pensando DSC vDPA interfaces, Qualcomm PMI8998 PMIC chargers, OmniVision OV01A10 sensors, Microchip corePWM pulse-width modulators, Renesas RZ/G2L MTU3a PWM timers, Qualcomm DWMAC SGMII SerDes PHYs, and Xilinx window watchdog timers.
- Sound: The kernel's sound subsystem has gained support for MIDI 2.0 devices, along with Realtek RT722 SDCA codecs, Analog Devices SSM3515 amplifiers, Google Chameleon v3 codecs, Google Chameleon v3 I2S interfaces, StarFive JH7110 TDM devices, Loongson I2S controllers, Loongson sound cards, Analog Devices MAX98388 speaker amplifiers, Texas Instruments TAS2781 speaker amplifiers, and Qualcomm WSA8840/WSA8845/WSA8845H class-D speaker amplifiers.
- USB: Qualcomm PMIC USB Type-C controllers, Cadence USB3 controllers, Cadence USBHS device controllers, and On Semiconductor NB7VPQ904M Type-C redrivers.
Miscellaneous
- This merge message and this one describe the (many) enhancements made to the perf tool in this cycle.
Internal kernel changes
- The scope-based resource management
patches have been merged, but there are unlikely to be any uses of
this mechanism in the 6.5 release. Torvalds said:
However, let's agree to not really use it for 6.5 yet, and consider it all purely infrastructure for the next release, and for testing it all out in linux-next etc.
We should probably also strive to avoid it for bug-fixes that end up going to stable. I'm sure this will all be backported to stable eventually, but I'd at least personally be happier if that started happening only after we actually have some more interaction with this.
- It is common knowledge that Linus Torvalds does not write much code
anymore. That doesn't keep him from getting his hands dirty on
occasion, though, as the merging of his
"expand stack" series shows. The adoption of the maple tree data structure complicated the
task of expanding the user-space process stack, breaking the locking
(or lack thereof) that had been in use creating the "StackRot" vulnerability. So Torvalds
dug in, unified much of the architecture-specific page-fault code, and
fixed the problem.
One of the cases that the new code does not handle is expanding the stack in response to a get_user_pages() call — something that Torvalds does not think should ever happen. There is a warning in place for the rest of the development cycle to sound the alarm if that assumption turns out to be wrong.
- The 32-bit devicetree files have been
massively reorganized to more closely match the 64-bit files. "
The impact of this will be that all external patches no longer apply, and anything depending on the location of the dtb files in the build directory will have to change.
" - Following the LSFMM+BPF discussion, the SLAB allocator has been deprecated and will, barring problems, be removed in a future development cycle.
- The new SLAB_NO_MERGE flag will prevent the slab allocators from merging a cache with (otherwise) compatible caches.
- There were 371 exported symbols added during this merge window and 94 removed; see this page for a full list. Two kfuncs (bpf_cpumask_any() and bpf_cpumask_any_and()) were removed, and five (bpf_cpumask_any_and_distribute(), bpf_cpumask_any_distribute(), bpf_cpumask_first_and(), bpf_sock_destroy(), and bpf_task_under_cgroup()) were added.
The most significant change that was not merged was bcachefs, despite its author having sent a pull request at the beginning of the merge window. The resulting thread showed that, while quite a few developers seem to want this code merged into the mainline, there are still a number of outstanding issues that need to be addressed and, to put it gently, a certain amount of tension with much of the development community. There is a reasonable chance that all of this will be worked out in time for 6.6, but it may be a noisy process.
Meanwhile, the next seven or eight weeks will be dedicated to the stabilization of all of this new code, with the final 6.5 release expected on either August 27 or September 3.
BPF iterators for filesystems
In the first of two combined BPF and filesystem sessions at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, Hou Tao introduced his BPF iterators for filesystem information. Iterators for BPF are a relatively recent addition to the BPF landscape; they help BPF programs step through kernel data structures in a loop-like manner, but without running afoul of the BPF verifier, which is notoriously hard to convince about loops.
In his remote presentation, Tao began with a quick overview of BPF iterators. They allow users to write a special type of BPF program that can step through kernel data structures in ways that would normally be handled with loops; instead, the BPF program contains callbacks that are made from the kernel in response to user-space reads of pinned BPF files. The callback is made for each new kernel object encountered in the data structure; the code in the callback can then present information from the object to user space in whatever format the developer wants.
As described in his LSFMM+BPF topic proposal, Tao envisions BPF iterators being used for gathering mount and filesystem information, which was topic of a session on the previous day. The RFC patch set he posted a few days prior to the summit contains two iterators to extract information from specific inodes or mounts. It also has tests to exercise the iterators.
He then briefly described the task-file iterator, which is a BPF selftest shown in the kernel documentation for BPF iterators. A user-space program can load the BPF program containing the iterator, start the iterator with the ID of the task of interest, then use the iterator's file descriptor to read some information for the given task's open files.
His idea is that BPF filesystem iterators would provide much more information than is currently available for various types of kernel objects, such as superblocks, inodes, directory entries (dentries), mounts, and so on. He envisions various use cases, including things like retrieving the order of the folios in the page cache, getting the page-cache information for files as an alternative to the proposed cachestat() system call (which was merged for 6.5), and gathering mount information in a more flexible manner than the proposed fsinfo() system call.
Christian Brauner pointed out that a BPF filesystem iterator was not going to be able to replace a new system call for gathering mount information. User-space programs may not be able to—or want to—rely on BPF for getting that information. He also has some reservations about exposing mount information to BPF, in general, due to "really intricate locking scenarios". Tao thought that a BPF helper could be provided that would do the proper locking.
The mount iterator from the patch set was up next. Aleksa Sarai asked if the intended users were kernel developers or regular user-space programmers; it looked to him like the iterator was meant for examining problems in the kernel. Tao agreed with that, noting that his examples were just trying to show what a BPF filesystem iterator could do. He also put up a slide showing his inode iterator.
After an organizer warned that time for the session was running out, Tao skipped ahead to the problems that need to be addressed. One problem is that these iterators require the CAP_BPF capability, so he wondered if an unprivileged BPF iterator would make sense. One way might be to allow regular users to access an iterator via a file pinned in the BPF filesystem; the permissions could be set by the administrator on the file to allow (or disallow) access. But, since the facility is targeted for debugging, it may be fine to only allow it for privileged users, he said.
Sarai was concerned that providing that level of detail for, say, a file's layout in memory to regular users would be detrimental from a security standpoint. He thought that even if administrators could enable it for regular users, they should not do so. He was adamant that it should not be done by default; "if an admin decides to enable this, they can deal with when someone exploits it". At that point, time ran out for the session; one of the organizers suggested that the conversation continue on the mailing list.
The FUSE BPF filesystem
The Filesystem in Userspace (FUSE) framework can be used to create a "stacked" filesystem, where the FUSE piece adds specialized functionality (e.g. reporting different file metadata) atop an underlying kernel filesystem. The performance of such filesystems leaves a lot to be desired, however, so the FUSE BPF filesystem has been proposed to try to improve the performance to be close to that of the underlying native filesystem. It came up in the context of a session on FUSE passthrough earlier in the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, but the details of FUSE BPF were more fully described by Daniel Rosenberg in a combined filesystem and BPF session on the final day of the summit.
Rosenberg said that he wanted to introduce the filesystem, describe its current status, and discuss some of the open questions with regard to future plans for it. The goal is for a stacked FUSE filesystem to come as close to the native filesystem's performance as the FUSE BPF developers can get. In addition, they want to keep "all of the nice ease-of-use of FUSE", with its "defined entry points"; the idea is to keep the interface "similar to what you would see from the FUSE daemon".
He put up a diagram showing what classic FUSE stacked filesystem does, which requires a "transition to and from user space several times" for each operation. The application makes a system call that reaches the FUSE driver in the kernel, which calls back out to the user-space FUSE daemon (for handling the specialized functionality). The FUSE daemon then makes another system call to the lower filesystem. The difference for FUSE BPF is that the in-kernel FUSE driver can use BPF to perform any filtering or other functionality for each operation in the kernel, then call directly into the VFS layer for the lower filesystem. If the FUSE functionality cannot be performed in BPF for some reason (e.g. consulting a database is required), it can call out to the FUSE daemon for the pre- and post-filtering.
The pre- and post-filters are what provide the functionality specific to that FUSE filesystem, whatever it is. For Android, which is where FUSE BPF is being used, there are specific directories that are being hidden using the filtering. You could also imagine a filter that changes the data being read in order to hide something from the applications.
The implementation uses the BPF operation structures (struct_ops) to replace certain kernel operations with the BPF filters. Those filters mostly access structures that contain the arguments for the operation, though there is a special buffer type to contain variable-length arguments, such as strings or data buffers. The BPF programs have the option of falling back to the normal FUSE path if desired.
One of the benefits of the operation-structure approach is that the FUSE BPF filesystem only needs to provide code for the operations it wants to intercept. In a "very dumb example that you would never actually want to do", a stacked filesystem that simply adds a character to the end of all file names would only need to implement filters for the lookup and directory-read operations.
In order to use FUSE BPF, the struct_ops program needs to be registered with the system using bpftool. After that, either at mount or lookup time, the program needs to be associated with the inode or dentry of interest using a file descriptor for the backing file or directory in the lower filesystem. The developers are willing to entertain other ideas of ways to identify the backing file, but a file descriptor was easy for their use case.
Rosenberg put up a table of benchmarks that used a tmpfs RAM-based filesystem as the lower filesystem, which exaggerates the performance improvements that are seen. The benchmarks show near-native performance, but with a more complex lower filesystem, the performance improvements much smaller, he said. He asked Paul Lawrence, who had run the benchmarks, if he wanted to comment. Lawrence agreed that the tmpfs testing showed a much larger benefit from FUSE BPF versus regular FUSE than would be seen in more realistic testing.
Rosenberg said that there are some things that they are still working on. One big thing on their to-do list is to perform the operations using the credentials of the FUSE daemon. There are additional FUSE opcodes, including the ioctl() opcode, that need to be implemented. Beyond that, some of the pre- and post-filters "are not fully hooked up"; he is waiting to see if some of that needs to change before rolling it out to all of the different opcodes.
Aleksa Sarai said that io_uring had some similar problems with credentials, so he suggested looking at what those developers did as a model; he thought that it involved creating a thread from the process whose credentials should be used. But Christian Brauner said that, for the 6.4 kernel, he had merged a generic API called "user workers" for doing this kind of credential handling. Rosenberg said that he was "all for using pre-existing stuff".
Lawrence said that thread and worker-queue switching on Android leads to a huge increase in latency, to the point where it cannot be shipped. They had run into problems with dm-verity due to its worker threads, for example. But he thinks the FUSE BPF credential problem can be solved fairly simply by running the I/O in the context of the FUSE daemon. That is the normal FUSE model, so he thinks FUSE BPF should stick with it.
One attendee asked about the optional pre- and post-filters in user space; if you are going to have to pay the price to call out to user space anyway, does it make sense to just do all the processing there? Rosenberg said that one transition between kernel and user space can still be saved in that case, though there is less of a benefit. But the reason that you are doing the filtering in user space is probably because there is a lot of work that needs to be done in the filter, the attendee said. Lawrence said that in the Android use case, there are just some small pieces that need to be handled with the user-space filters and the vast majority of the file operations will use the BPF filters and stay in the kernel, saving the transition cost. He did acknowledge that there might be a better way to handle that, but that they had found it useful to have the user-space filters for Android.
Josef Bacik wanted to confirm his understanding of what features FUSE BPF would add. In particular, there are two separate pieces: adding the ability to pass operations directly to the underlying filesystem using the file-descriptor registration and the ability to attach the BPF pre- and post-filters for the operations on the upper filesystem. For filesystem-passthrough purposes, the lookup operation could open the underlying file, then associate that descriptor, and all of the rest of the operations would go directly to the underlying filesystem. Rosenberg agreed with that, noting that there would not be any BPF needed for handling passthrough.
Another attendee wondered if the only way to do the association was at lookup time; for the composefs use case, it would be better done when the file is opened. Lawrence said that Meta had already asked that FUSE BPF move away from requiring file-descriptor registration and that a path should be used instead, so that the association can be done without an open file. The plan is to change to using a path relative to a file descriptor (either of which can be null) for the association, which is the usual convention for the *at() system calls.
There was some discussion of allowing association at either lookup or open time, but Lawrence said that they have not looked into that deeply yet. It was fairly straightforward to only allow associating files at lookup time, but it may make sense to broaden that. Brauner said that adding association at open time would really complicate the code; he suggested keeping the implementation simple, at least for now.
Sarai said that it was important to use the openat2() resolve flags when the switch was made to do relative lookups. There are classic problems when resolving paths that can allow malicious programs to access parts of the filesystem that should not be allowed. If the proper resolve flags are used, that should easily eliminate those kinds of escapes.
Rosenberg said that, currently, there is a problem falling back to regular FUSE because the BPF code uses a FUSE node ID of zero when it creates nodes, but FUSE does not understand that value. There needs to be a way to reserve a block of node IDs for BPF to use, but it is not a problem that they have encountered so it has been deferred. There are also two issues with the struct_ops: there is no module support for them, so he has hacked around that, and there are a whole raft of struct_ops callbacks needed, which required him to allocate two pages of memory to hold them all. Those both need to be cleaned up.
They have some plans for upstreaming the code. The patch set is rather large at this point, with more than 30 patches, he said, which he is trying to arrange to make them as easy as possible to review. The current organization has the passthrough patches first, with the BPF pieces coming later in the patch series.
Bacik suggested that he split the series into two, with the passthrough pieces as the first. When he tried to review the current series, it was difficult to grasp that FUSE BPF is made up of two separate things. He recognizes that it is a single project from their perspective but it would help him and perhaps others to break things up. Rosenberg acknowledged that, but noted that the Android developers do not have a use for the pure passthrough version, though it is a good intermediate point.
Brauner asked why the passthrough piece was useful at all, but Amir Goldstein emphatically said "I don't understand, it's so useful". Once a file has been looked up or opened, all of its I/O can be passed through directly to the filesystem; he suggested that providing passthrough for directory operations would also be useful, but others were less sure. Miklos Szeredi agreed that file passthrough was useful, while Bacik thought that all of it was, just that it should be broken up for review purposes.
Testing for storage and filesystems
The kdevops kernel-testing framework has come up at several earlier summits, including in two separate sessions at last year's event. Testing kernel filesystems and the block layer, not to mention lots of other kernel subsystems, has become increasingly important over time. So it was no surprise that Luis Chamberlain led a combined storage and filesystem session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit to talk more about testing, the resources needed for it, and what can be done to improve it. It was the final session for this year's summit, so this article completes our coverage.
Update
Chamberlain began by saying it was time "for the boring thing that no one likes to talk about, which is testing, but we all have to do somehow". He reviewed some of the plans that came out of last year's discussions, noting that the shared repository for testing results has been set up (in the kdevops repository linked above). It is a cooperative effort in the shared repository where, once a while, something gets broken and needs to be reverted, but overall it is working well. There is a Discord server ("I know some people hate that, but whatever") and there is an IRC channel on OFTC as well.
He recently found out that there is interest in storing extra information for successful runs along with the failures; there are tools that can scrape that information to do various kinds of analysis and display. Kdevops currently only creates a tarball of the failures, which can be committed to the repository; he is not sure if adding the successes to that will make the tarball too large to be stored in Git, but if so, perhaps it can go into some other kind of repository.
So far, it is only he and Chandan Babu who have been storing their results in the repository, but others can easily join in. The repository supports namespaces for testing efforts to use for their specific results. The compressed tarball contains logs and such of the failures, which can be decompressed and searched for things of interest, Chamberlain said.
There is support in kdevops for compiling kernels on fast systems using the 9p filesystem; those kernels can then be copied to guest systems where the tests are run. They encountered a few 9p bugs, but it seems to work fine with caching disabled at this point; if there are advantages to doing so, switching to NFS is possible. He said that Darrick Wong had thought that modules were not supported for kdevops kernels, but that is not the case; in addition, module signing can be used to ensure the proper modules are loaded.
The framework is being used as part of the XFS stable testing that was the subject of a session earlier that day, Chamberlain said. Both Babu and Amir Goldstein are using it for their testing as part of that effort and it is an example of kdevops being used in two different ways: on local virtual machines (VMs) and in the cloud. In fact, Babu added support for the Oracle cloud (OCI) to kdevops so that he could run his tests on that platform. The kdevops cloud support uses Terraform so other cloud providers with support for that can be added easily, Chamberlain said.
He would not be giving a kdevops demo, he said, because he wanted to discuss other things. There are demos available on YouTube already and he is happy to add others for specific workflows or other pieces as needed. He uses kdevops for day-to-day development work and not just for testing, so it can be used in multiple ways.
Kdevops is using virtio because it ran into problems with QEMU instantiating NVMe drives. It uses the IOThread feature of QEMU to avoid the global lock by allowing each drive to run independently. Without IOThread, there were lots of problems with timeouts when he and Goldstein were testing XFS. Support for IOThread on NVMe is coming, which should allow switching away from virtio, Chamberlain said.
There is initial support for testing on arm64 systems in the cloud; he needed that for testing his work on supporting large block sizes. Some systems are reporting larger block sizes; but they are generally doing so by emulating them using atomic writes. There are no local virtualization images for arm64 available, other than for openSUSE Tumbleweed, that he knows of at this time; he is not sure whether there are plans for other distributions to add them.
Resources
The main thing he wanted to discuss was the resources needed for testing and limiting the scope of the testing in order to use those resources most effectively. The non-controversial suggestion that he had was to use the MAINTAINERS file to track which test runner and which tests to use for individual filesystems and block-layer pieces; the idea is to allow the community to help with testing in ways that will be useful to the maintainers of those parts of the kernel.
But there is also a need for systems to use for automation and, of course, for people to do the work on running the tests and maintaining the test systems. His employer, Samsung, has allowed him to share the system that he uses for development with others; community developers who want to test can simply log in and do what they need to do. That has reached a point where there are times that he cannot get his work done on that system so he has to ask the other developers if he can shut some of their VMs down.
That led him to ask Samsung for an additional system, but the company asked him to see what other vendors might be willing to provide. He started that process and one vendor has provided cloud credits for use by the community. Wong came in remotely to encourage people to use the OCI free tier for their kdevops testing needs; "we provide the hardware and Luis provides the software". Jeff Layton asked if anyone could volunteer to write some documentation on using kdevops on OCI; Chamberlain said that he could do so if no one else got around to it.
He also talked to Microsoft about Azure and to Alibaba about its cloud offering, but they are still in the evaluation stage. Wong said that he had resisted using OCI because he was so accustomed to using the pet machines in his lab; once he started, it worked well. "I can spin up like 170 VMs to go run testing on several different profiles and I can run this thing every night." That all went really well until he "managed to consume all of the department's resources and now they are telling me that I need to back off a little bit". Chamberlain agreed that kdevops testing may encounter those kinds of resource limits.
He encouraged attendees to see what they could do to help procure more resources for the effort. Hannes Reinecke said that it may be difficult for companies to provide login access to their systems to anyone who is not an employee. Internally, SUSE has systems that test particular Git trees or branches automatically, then provide the results of that testing, so even kernel-developer employees do not log into those systems.
Chamberlain said that it is important that this effort be vendor-neutral; people switch jobs but still need to be able to test their work. The more resources there are available from multiple companies, the better the whole testing environment will work over the long term. If there are ways that companies can run tests for the community on specific Git trees and branches, that would definitely be useful as well.
Ted Ts'o said that an arrangement where a certain Git tree or branch was watched and tested after changes might be more palatable to companies; that way, random non-employees would not be logging in and the company could throttle the amount of testing per day that it does. Chamberlain agreed, but said that it is important for the maintainers of the components to specify the tests that they want to have run for their subsystems; they can provide a configuration for kdevops or some other tool that can be used by automated systems.
Ts'o said that he would really like it if developers could run a simple smoke test on their own systems before submitting a patch to him; he has a test appliance that can easily run in a VM on a developer's regular system to find many of the simple problems with a given patch. It takes only about 15 minutes to run that and no cloud resources are needed at all. Damien Le Moal pointed out that people do not know how to go about doing that, so adding a pointer to the information in the MAINTAINERS file would be helpful. Ts'o agreed, noting that there is a difference between the big, expensive, long-running tests that he or Wong might run and the simple smoke tests that individual developers can run on their own patches; that model scales well since there are way more developers than maintainers.
Amir Goldstein said that since the smoke tests are so simple, it would be easy for Google or someone to run them automatically for developers when they push a commit to a specific branch. He asked Layton about a recent ctime bug that was found recently; didn't Layton get an email from a testing bot about that problem? Layton said that he did, but that the coverage of the testing bots is lacking; they test the major filesystems, but not NFS or CIFS, for example.
Chamberlain said that he is available to help developers get set up with kdevops and to automate their workflows with it. He has recently added some demo workflows, including one with basic support for Compute Express Link (CXL); it used to be difficult to set up testing for CXL but it should be much easier now. Josef Bacik added PCI-passthrough support, so there is now a kdevops configuration for that.
People
Something else that is needed for these efforts is volunteers to run tests, Chamberlain said. As came up in the XFS-stable-testing session, there are maintainers who want help testing their subsystems for stable kernels and other reasons. It is a good opportunity to learn about the subsystem and the community, in general; it will also provide insights about new features and technology. All of the filesystem maintainers need help with testing, he said, "so if you have any interest, poke at them".
An alternative would be to pool money to hire people for this work. At earlier LSFMM gatherings, it was said that "money is not the problem", but that there needed to be a framework for the testing effort. Some of that work has been done at this point, so does it make sense to try to gather up financial resources to attack the problem? The current financial climate in the industry ("layoffs happened") may preclude that, so until that changes, finding volunteers to do this work is needed.
Layton wondered if it made sense to "hire people to just push buttons"; it would be better to "automate as much of this as we can" to have computers do the work rather than people. Ts'o said that if money were available, he thinks it might be best spent on enhancing KernelCI so that there could be a common dashboard reporting on all of the various testing efforts. The results of fstests could be sent to a central location, along with the test artifacts that would help someone track down the causes of any failures, and the dashboard would allow people to view all of that information.
The ability for others to see the failures along with enough information to look into them is valuable. The dashboard for the syzbot fuzzer has allowed community members to track down various bugs, fix them, and send him patches, so that model can work well, Ts'o said. The idea is to allow others beyond just the person running the tests to fix the problems that are found.
Chamberlain said that he had looked at integrating with KernelCI, but bounced off of the LAVA continuous-integration (CI) system that is used for most of its test labs. Having a public dashboard is the right model, Ts'o said, but that all of the money that went into KernelCI targeted testing Arm boot and devicetree. "Someone needs to throw more money at KernelCI for other subsystems other than devicetree", he said. If someone wanted to look into the LAVA stuff for use in kdevops, it would be helpful, Chamberlain said.
Test changes
Steve French asked about a problem he has encountered: the fstests change over time, so tests that once failed suddenly will start passing (or vice versa). Chamberlain said that kdevops users have encountered that problem as well. There is a need to stabilize the tests, but choosing a particular fstests tag to stick with for a while is the best that can be done right now. He would like to see tags get added for blktests as well.
Ts'o said that a maintainer who wants to be running fstests regularly needs to follow and actively participate in the fstests mailing list in order to keep up with changes and fixes in the tests. For example, a test might be added to ensure that a specific security problem has been fixed; a filesystem maintainer will want that test, so they will not be able to stabilize on a six-month old fstests release. Sometimes tests fail due to, say, an upgrade of Bash or coreutils; the fixes for those kinds of problems will be needed as well.
But Goldstein said that he thinks one of the goals of kdevops is simplicity. It has "expunge" files (lists of tests that are expected to fail) that are extremely specific with regard to which kernel they apply to; if you want to test a different kernel, a new expunge list (or symbolic link to an existing one) is needed. It is not perfect, he said, but it does meet the simplicity goal. Ts'o, who has his own xfstests-bld testing framework, said that he has been running the exclude files (similar to the expunge files for kdevops) through the C preprocessor with #ifdef sections for different kernel versions. He suggested that kdevops might want to do something similar; Goldstein noted that the Fixes: tags could be used by a preprocessor to automatically reflect changes into the expunge/exclude file.
An attendee shifted gears a bit by describing what the BPF community has been doing with its testing. He said that it uses Patchwork as an integral part of the workflow; Patchwork picks up patches from the mailing list and runs tests on the GitHub CI system so that developers can see if their code is causing failures. That system has worked well for the BPF project, he said.
Christian Brauner said that, perhaps surprisingly, Patchwork is set up for the kernel, "it is just unused". He has a to-do item to look into using it because he thinks the patch-series tracking would be useful, separate from any CI integration as was suggested. Ts'o noted that there are other Patchwork instances that are maintained for some subsystems; it would be good to list those instances somewhere in the kernel documentation, perhaps the MAINTAINERS file or the subsystem-description documentation that is being worked on.
French thinks that fstests does not have enough tests; there are around 800 currently, but he thinks it should be more like 2000. He wondered if there was a way to make the test framework so compelling that it would cause bug reporters to also send a test case that could be incorporated into fstests. Can it raise the visibility of the importance of tests in a way that would attract more test development?
Chamberlain said that he thinks the answer is "yes", but that kdevops itself is not the right component for adding tests; it exists to automate running the testing tools that already exist in the community: fstests, blktests, the kernel selftests, and others. Ts'o said that it is difficult to write test cases for fstests, in general, because they need to be "small, self-contained, and easy to reproduce". The bug reports he gets tend to be long-running without failing reliably; once the bug is found, he tries to come up with something on the order of a ten-line shell script to reproduce it. In the rare cases where the bug reporter has a small test to reproduce the problem, it is possible to encourage them to turn that into a test case for fstests. Better documentation on how to write those test cases would help as well.
French asked about testing stable kernels, which is not something he does often; Chamberlain said that it is easy to add a new kernel to kdevops, but the time-consuming part is to get a baseline of the expected test failures in order to create the expunge list for it. Fstests has tons of tests that fail for one reason or another, which is expected; determining why and documenting which should fail takes time. Ts'o said there are ways to use fstests without creating the baseline; when evaluating patches for the stable kernels, he looks for failing tests and then tries them on the earlier kernel. If the test fails at more or less the same rate on the earlier kernel, it gets treated as an expected failure, otherwise the patch itself is the likely culprit.
He noted that if this testing is going to use cloud resources, it is important that it use them efficiently. For example, he looked at the OCI free tier and noted that idle VMs get shut down quickly, which makes sense; VMs should be created just as they are needed and automatically shut down after the test run has completed. He has done some work to ensure that a kernel that hangs because it is spinning in a deadlock gets automatically killed rather than run for hours or days uselessly.
Shin'ichiro Kawasaki, who works on blktests, said that while most of the discussion focused on fstests, it is largely applicable to blktests as well. Unlike fstests, though, blktests are rather small, so he is not sure that a tag for them is all that useful; he suggested using Git commit IDs instead. Chamberlain agreed that could work, but was hoping that the tag could effectively indicate a release that had been tested and "blessed" by the blktests developers. Kawasaki agreed that tagging would help with that, so tags will be applied in the future.
With that, time ran out on the session and, for the most part, on LSFMM+BPF as a whole.
Page editor: Jonathan Corbet
Next page:
Brief items>>
