LWN.net Weekly Edition for June 15, 2023

Welcome to the LWN.net Weekly Edition for June 15, 2023

This edition contains the following feature content:

Yet another memory allocator for executable code: allocating memory to hold kernel-mode code can be surprising challenging; the "JIT allocator" is another attempt at solving the problem.
Addressing priority inversion with proxy execution: traditional approaches to priority inversion don't work with deadline scheduling; proxy execution may offer a solution.
Deadline servers as a realtime throttling replacement: nobody likes the kernel's realtime throttling hack, but nobody has done anything about it — until now.
Ongoing LSFMM+BPF coverage, including:

Two VFS topics: discussion of a sequence number to prevent mount races and on the mount-beneath feature.
Mounting images inside a user namespace: a solution to enable unprivileged users to mount filesystem images.
Hardening magic links: files like /proc/[PID]/exe are actually magic links, which can be used to cause security problems, so hardening them is desired.
Retrieving mount and filesystem information in user space: there is a need for an API to gather information about mounted filesystems, what should it look like?

Reports from OSPM 2023, part 1: a set of summaries by the presenters at the 2023 Power Management and Scheduling in the Linux Kernel Summit.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Yet another memory allocator for executable code

By Jonathan Corbet
June 8, 2023

The kernel is an increasingly dynamic body of code, where new executable text can show up at any time. Currently, the task of allocating memory for new kernel code falls on the subsystem that first brought the ability to load code into a running kernel: the module loader. This patch set from Mike Rapoport looks to move the responsibility for these allocations to a new "JIT allocator", addressing a number of rough edges in the process.

In order to support the ability to load modules at run time, the kernel had to gain the ability to allocate memory to hold those modules. Early on, that was just a matter of calling vmalloc() to obtain the requisite number of pages and enabling execute permission for the resulting pages. Over time, though, things have grown more complicated — as they so often seem to do.

On one side, the number of subsystems that load code into a running kernel has been growing. Tracing, for example, can require adding small bits of code to the kernel. A more frequent user in current kernels is the BPF subsystem, which can see (usually) small executable segments come and go on a regular basis. The proposed bcachefs filesystem has an even more esoteric use case; it generates a special unpack function for each B-tree node, on the fly, for performance. All of these new users tend to stress the memory-management subsystem in different ways, leading to direct-map fragmentation and other performance problems.

To that can be added the proliferation of processor architectures, some of which restrict the address ranges that can be used to hold kernel code. Various architectures have added their own overrides to the module allocator, complicating the code overall. Architecture maintainers are aggressively moving toward a strict regime where executable memory can never be writable at the same time, making it harder for an attacker to load code into the kernel. That, too, complicates the task for subsystems that need to write code into kernel memory.

Rapoport's patch set is intended to simplify life for kernel subsystems that need to allocate memory for executable code. It replaces the existing module_alloc() interface with a pair of new functions:

    void *jit_text_alloc(size_t len);
    void jit_free(void *buf);

A call to jit_text_alloc() will return a len-byte range of executable memory, while jit_free() will return that memory to the system. The memory is initially zero-filled. On systems implementing a strict separation of executable and writable memory, it will not be possible to directly copy loadable code into this allocation; instead, one or both of these functions should be used:

    void jit_update_copy(void *buf, void *new_buf, size_t len);
    void jit_update_set(void *addr, int c, size_t len);

jit_update_copy() will copy executable text from buf into new_buf, which was returned from jit_text_alloc(), while jit_update_set() will set a range of that memory to a constant value.

On some architectures, data associated with a code region must be allocated near that region; data segments for kernel modules can be subject to this requirement, for example. To ensure proper placement, memory to hold this data can be allocated with:

    void *jit_data_alloc(size_t len);

With this set of functions, kernel code can allocate and use space for new executable segments. There is still the matter of architecture-specific constraints, though. These constraints mostly take the form of rules about the placement of executable allocations in the kernel's virtual address space. Rather than have each architecture reimplement jit_text_alloc() to meet its special requirements, Rapoport introduced a new structure to simply describe those requirements to a central allocator:

    struct jit_address_space {
	pgprot_t        pgprot;
	unsigned long   start;
	unsigned long   end;
	unsigned long	fallback_start;
	unsigned long	fallback_end;
    };

There are two of these structures to be provided by architecture-specific code: one describing the requirements for executable allocations, and one for data allocations. In each, the pgprot field describes the protections that must be applied in the page tables, while start and end delineate the address range in which the allocations should fall. Some architectures implement a second "fallback" range to be used if an allocation attempt from the primary range fails; the location of the fallback range, if any, is stored in fallback_start and fallback_end.

These structures are then bundled into an overall structure controlling how allocations of executable memory (and associated data) are handled on any given architecture:

    struct jit_alloc_params {
	struct jit_address_space	text;
	struct jit_address_space	data;
	enum jit_alloc_flags		flags;
	unsigned int			alignment;
    };

The flags field allows for the expression of additional, architecture-specific quirks, while alignment allows the specification of the minimum alignment required for such allocations. A certain amount of digging is required to learn that alignment is interpreted as a power of two; alternatively, one can think of it as the number of least-significant bits that must be zero in a properly aligned address.

With this infrastructure in place, it is possible for the kernel subsystems needing to allocate space for executable text to get the memory they need. Since this allocator is separate from the kernel's module loader, it is no longer necessary to enable loadable modules to be able to load other types of code. No real effort has been made to address the performance issues associated with the allocation of executable memory; the idea is that this sort of optimization can be added after the interface has been agreed on.

Comments on this work have fallen into two broad categories. Rick Edgecombe worried that this interface could expose executable code that has not yet reached its intended state. Module code, for example, can be tweaked in a number of ways after it lands in memory. It might be better, he suggested, to prepare the code area first before making it executable.

The other concern, from Mark Rutland, was that, on some architectures at least, the requirements for the placement of executable code vary depending on the type of the code. Loadable modules on arm64, for example, have tighter restrictions than kprobes do. Holding all allocations to the tightest constraints could conceivably cause an address-space shortage in the target area. He suggested creating separate allocators for each memory type, all of which might still use a common infrastructure underneath. Rapoport answered that, if it turns out to be necessary, the central infrastructure could learn to apply different rules to different allocations. It's not entirely clear, though, that the problem is serious enough to need this kind of solution.

Overall, the patch set looks like a reasonable start toward a proper API for the allocation of executable memory in the kernel. There have been several attempts in this area over the last few years, though, and nothing has yet made everybody happy. So we'll have to wait to see what might happen this time around.

Comments (1 posted)

Addressing priority inversion with proxy execution

By Jonathan Corbet
June 9, 2023

Priority inversion comes about when a low-priority task holds a resource that is needed by a higher-priority task, with the result that the wrong task is the only one that can run. This problem is arguably most acute in realtime settings, but it can happen in just about any system that has multiple tasks running. The variety of scheduling classes provided by the Linux kernel make handling priority inversion a difficult problem; the latest version of the proxy execution patch series points toward a possible solution.

To understand priority inversion, imagine that a low-priority, background task acquires a mutex. If a realtime task happens to need that same mutex, it will find itself blocked, waiting for the low-priority task to let go of it. Should yet another task, with medium priority, come along, it may prevent the low-priority task from executing at all, meaning that the mutex will not be released and the realtime task will be blocked indefinitely. That is exactly the sort of outcome that the priority mechanism is intended to prevent.

A classic solution to priority inversion is priority inheritance. If a high-priority task finds itself blocked on a resource held by another, it lends its priority to the owning task, allowing that task to complete its work and release the resource. The Linux kernel has supported priority inheritance for a long time, but that is not a complete solution to the problem. Deadline scheduling complicates the situation, in that it is not priority based. Since a task running in the deadline class has no priority, it cannot lend that priority to another task. So priority inheritance will not work with tasks using deadline scheduling.

Kernel developers have been working on this problem for some time; it was discussed at the 2019 and 2020 scheduling and power management (OSPM) conferences, for example. The current patch set, posted by John Stultz but containing the work of a number of developers, shows the current state of this work. At its core, "proxy execution" involves letting a blocked process lend its entire scheduling context to another task holding a needed resource.

To be able to implement proxy execution, the scheduler needs to know exactly which resource a blocked task is waiting for. The task_struct structure already contains a struct mutex pointer called blocked_on that serves exactly this purpose but, in current kernels, it is only compiled in if mutex debugging is enabled. The patch series makes this field unconditional so that this tracking is always performed. The mutex structure already has a pointer to the task that owns it at any given time; the patch series makes that pointer available to the scheduler. The combination of these two pointers allows the scheduler to locate the task holding the resource needed by another task.

The task_struct structure contains a vast amount of information about a running task. The patch series recognizes that this information serves two different roles relevant to scheduling: the execution context and the scheduling context. The execution context contains the information needed to run a given task, while the scheduling context describes how the task will be treated by the CPU scheduler. To enable a logical separation of these two roles, the rq (run queue) structure gains a second task_struct pointer for the scheduling context. Most of the time, the execution and scheduling contexts for a given run-queue entry will be the same, but proxy execution may cause them to differ.

The scheduler's run queues hold tasks that are in a runnable state — they would be on a CPU if one were available for them. When a task blocks to wait for a resource, it is removed from the run queue until it becomes runnable again. One of the more interesting changes made by this patch set is to cause blocked tasks to remain on the run queue, even though they are not, in fact, runnable. That causes the scheduler to pick the first task that it would run, assuming its resources were available, rather than the first task that it can run.

This mechanism may thus leave the scheduler trying to run a task that can't actually run; this is the time for the scheduler to give the CPU to the task holding the resource blocking the execution of the task that the scheduler really wants to run. With the infrastructure described above, implementing this proxy execution is conceptually simple. If the chosen task is not runnable, then follow its blocked_on pointer to find the task it's waiting for, give that task the blocked task's scheduling context (thus boosting its position in the run queue), and run it instead. When the boosted task releases the mutex it is holding, it will lose the other task's scheduling context, and the higher-priority task will be able to continue. Problem solved.

Naturally, there are a few complications. The task holding the needed mutex may, itself, be blocked on yet another resource, so the scheduler will need to be able to follow a chain of blocked-on relationships. A scheduling context may include a constraint on which CPUs may be used, so a task running as a proxy may need to be migrated to a different CPU first. The scheduler has to keep proxy execution in mind before deciding to migrate a task to another CPU as part of its normal load balancing. CPU-time accounting also becomes more complex; the time used by a task while running as a proxy for another should be charged to the running task, but it is taken from the higher-priority task's time slice to maintain scheduling fairness.

The kernel normally tries hard to spread realtime and deadline tasks across the system's CPUs so that all of them can run, but proxy execution binds the tasks involved onto the same CPU. If one of them is to be migrated to achieve the needed separation, both must be — and here, too, there may be a chain of blocked tasks to worry about. One of the most complex patches in the series attempts to solve this problem. Rather than create "some sort of complex data structure" to track the ability to move tasks, it changes the load-balancing code to simply search through the list of potentially movable tasks. The idea here is that, once the behavior is seen to be correct, optimizations can be applied.

The patch series has not received any review comments as of this writing; all reviewers, it seems, are blocked on other tasks. Given the complexity and long history of this work, though, it seems unlikely that this version will be the last one. Even seemingly simple changes can be hard to apply to the CPU scheduler without creating subtle problems, and this change is not simple.

Comments (3 posted)

Deadline servers as a realtime throttling replacement

By Jonathan Corbet
June 12, 2023

The CPU scheduler's one job at any given time is to run the task that has the strongest claim to the CPU. There are many factors that complicate that job, not the least of which is that the "strongest claim" is sometimes a bit of a fuzzy concept. Realtime throttling, a mechanism designed to keep a runaway realtime task from monopolizing the CPU, is one case where developers have concluded that the task with, ostensibly, the highest priority should not actually be the one that runs. But realtime throttling has rarely pleased anybody; the deadline-server infrastructure patches posted by Daniel Bristot de Oliveira are the latest attempt to find a better solution.

The POSIX realtime scheduling classes are conceptually simple; at any given time, the task with the highest priority runs to the exclusion of anything else. In the real world, though, the rule enables a runaway realtime task to take over the system to the point that the only way to recover it may be to pull the plug. Power failures, as it turns out, have an even higher priority than realtime tasks.

Yanking out the power cord is aesthetically displeasing to many, though, and tends to cause realtime deadlines to be missed; in an attempt to avoid it, the kernel developers introduced realtime throttling many years ago. In short, realtime throttling restricts realtime tasks to (by default) 95% of the available CPU time; the remaining 5% is left for lower-priority tasks, with the idea that it is enough for an administrator to kill off a runaway task if need be.

Most of the time, this throttling is not a problem. In a properly designed realtime system, the actual realtime work should be using far less than 95% of the available CPU time anyway, so the throttling will never actually happen. But, in cases where a realtime task does need all of the available CPU time for an extended period, realtime throttling can be a problem. This is especially true because the throttling happens even if there are no lower-priority tasks waiting to run. Rather than run the realtime task that still needs CPU, the scheduler will simply force the system idle in this case. The idle time is an unwanted artifact of how the throttling is implemented rather than a desired feature in its own right.

Various efforts have been made to address this problem over the years; this article describes one approach, where realtime throttling would be disabled if it would cause the system to go idle. The deadline-server idea is a different approach to the problem, based on the deadline scheduling class. This class, which has a higher priority than the POSIX realtime classes, is not priority-based; instead, tasks declare the amount of CPU time they need and the time by which they must receive it, and the deadline scheduler works to ensure that those tasks meet their deadlines.

This class thus seems like a natural way to take back 5% of the CPU from realtime tasks when needed. All that is needed is to create a task in the deadline class (called the "deadline server"), declare that it needs 5% of the CPU, and have that task run lower-priority tasks with the time that it is given. The scheduler will then carve out the necessary CPU time but, if the deadline server doesn't need it, it will simply not be runnable and the realtime tasks can continue to run.

The idea, as implemented in Bristot's patch set (which contains patches from Peter Zijlstra and Juri Lelli), does the job reasonably well, in that it makes space for lower-priority tasks without needlessly causing the CPU to go idle. The fact that the deadline class has a higher priority than the realtime classes makes this idea work, but also brings one little problem: once the deadline server is enabled, it will run immediately, perhaps preempting a realtime task that would have eventually yielded anyway. The lower-priority tasks should get their 5%, but giving it to them immediately may create problems for well-behaved realtime tasks.

The proposed solution here is to delay the enabling of the deadline server. A kernel timer is used to occasionally run a watchdog function that looks at the state of the normal-priority tasks on the system. If it appears that those tasks are being starved — with starvation defined as not getting any CPU time over a half-second — then the deadline server will be started. Otherwise, in the absence of starvation problems, scheduling will run as usual.

With this tweak, the work is moving "in the right direction", Bristot said, but there is still room for improvement. The delay of the startup of the deadline server can be further delayed to the "zero-laxity" time — the time just before it would miss a 5% deadline entirely. The starvation monitor could perhaps be moved to CPUs that are not running realtime tasks to prevent interference there. In general, though, this work looks like it could be a plausible solution to the realtime-throttling problem.

Comments (14 posted)

Two VFS topics

By Jake Edge
June 9, 2023

LSFMM+BPF

Two different topics concerning the virtual filesystem (VFS) layer were the subject of a session led by VFS co-maintainer Christian Brauner at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit. As might be guessed, it was a filesystem-track session; Brauner had three separate items he planned on bringing up, but the discussion on the first two consumed the whole half-hour—and then some. A mechanism to avoid media-change races when mounting loop (or loopback) and other devices was disposed of fairly quickly, but the discussion around the mount-beneath feature went on at length.

Diskseq

There is an issue for container runtimes, Brauner began, because they use a lot of loop devices, but the backing file for a device can change without notification to the runtime. It is a longstanding problem, but Christoph Hellwig came up with the idea of providing a monotonically increasing disk sequence number, which can be used to detect media changes. It would also be useful to detect that a USB storage device has been removed and a new one inserted.

The value of the sequence number would be queried using the BLKGETDISKSEQ ioctl() command, which means that user-space processes could detect when loop (and other) devices have changed their media. There would also be entries in a new /dev/disk/by-diskseq/ directory so that disks can be referenced by their sequence number. This "eliminates a bunch of races, but not all of them", Brauner said.

There is a kind of a time-of-check-to-time-of-use race that can still occur, which could lead to the wrong media getting mounted. He has pitched an idea to Hellwig to add a "source-diskseq" property to fsconfig(); before the actual mount occurs, the source and the source-diskseq properties could both be verified for the block device, ensuring that only the proper media is actually mounted.

In general, Brauner thinks that these changes are uncontroversial, but he wanted to run them by attendees. One implication of the changes is that block-backed filesystems that have not yet been switched over to the new mount API will need to be. He is fine with doing the work to make that happen. Once it is done, there is still a bit of work that needs to be done on the block layer to support the sequence numbers.

Josef Bacik asked how many filesystems still needed to be converted ("other than one specific filesystem that we know about"). Brauner said that he was not sure, he just assumed that some still exist. Bacik said that he would be doing the conversion for Btrfs "next week, honest", which was met with a good bit of laughter. Brauner said he could probably do it, if needed, but Bacik said he had multiple requests for switching Btrfs, so he would be getting to it soon.

Lennart Poettering suggested that mounted filesystems should also be checking the sequence number to ensure that something surprising has not happened underneath them. Ted Ts'o said that Hellwig has sent patches that would provide a mechanism for the block layer to inform a filesystem that the media has changed, so that the filesystem can simply shut down. It is better if the filesystem is informed about an eject (i.e. media-removed) event, rather than having to check frequently to see if the sequence number has changed.

Mount beneath

The second topic that Brauner wanted to discuss was the mount-beneath (formerly filesystem tucking) operation. It is a way to upgrade or replace a mount by mounting a new filesystem beneath the one being replaced in the mount stack, so that the underlying mount point is not exposed in the process. There is a tricky requirement, however, he said, in that it needs to mesh well with mount propagation.

One use case is for containers with a shared /usr that gets periodically updated. Without the new feature, each update of /usr means that the new one gets mounted atop the existing stack of previous versions; a system with 1000 containers and five updates has 5000 mount entries. Alternatively, unmounting the old /usr first leaves a window in which the underlying mount point is exposed to the services in the container.

The mount-beneath feature was the easiest way that he came up with to avoid those problems when updating filesystems. The kernel walks the mount stack for a given mount point to find the topmost mount and then it inserts the new mount just below the topmost. Then the topmost mount can be unmounted and users will never see any lower mounts or the mount point; they either see the topmost mount or the new mount. It is a way to replace a mount without falling into all of the complexities that would come from actually directly doing a mount-replace operation, he said.

Ts'o asked how beneath mounts interacted with overlayfs. Brauner said that there should be no problems with that because the overlayfs mount includes all of its constituent filesystems into a single mount. So a new overlayfs can be mounted beneath an existing one without any difficulty.

Brauner demonstrated the feature, which is a bit hard to describe; those interested may want to see it in the YouTube video of the session (right around 14:08). He wanted to show that when mounting multiple times, there is an amplification effect due to mount propagation. If the parent mount and the child mount are in the same peer group, they propagate to each other. Stacking a bunch of identical mounts on top each other sets up something approaching a combinatorial explosion of mounts due to mount propagation. He is unsure if these semantics were intended, but wanted to avoid that for beneath mounts.

David Howells asked if it would be easier to have a "swap mount" operation that would switch an existing mount for a new one. Brauner said that is effectively the same as the replace operation he had already mentioned. If a mount that is being replaced has child mounts (on subdirectories), they would need to be moved to the replacement, which may or may not have the right child mount points. Mount beneath neatly sidesteps that problem by leaving the problem of what to do with the child mounts up to user space; before it can unmount the old filesystem, it will need to do something with the child mounts.

Howells was concerned that inserting a mount into the stack, as proposed for mount beneath, would cause problems, but Brauner said that can already happen today. There are ways to insert a mount beneath another. Since that does not cause a problem today, he believes that it will not be one for beneath mounts. He demonstrated some of that around 19:30 in the video as well.

Remote participant Al Viro pointed out that a child mount on, say, /usr/local in the old to-be-updated filesystem could get lost when using mount beneath. Once a new /usr was inserted below, unmounting the old /usr will only succeed if it is a lazy unmount but then the old /usr/local is no longer accessible. It is inconvenient to have to mount local (and any other mounts on /usr subdirectories) on the new /usr before doing the mount beneath operation, but that is what has to be done to preserve the hierarchy. Brauner agreed that was the case, but he does not see it as a big problem.

Viro said that the new /usr could be mounted somewhere accessible, each of the subdirectory mounts on the existing /usr could then be bind-mounted to the new in the right places. After that, the new one could be mounted beneath the old and the old could be lazily unmounted. Brauner thought that all of that should work with the existing mount-beneath feature.

There was some discussion between Viro and Brauner about the propagation problem that was demonstrated. Brauner avoids that in his patches for mount beneath by simply returning an error if this mount-propagation explosion is going to happen. Viro did not seem to be opposed to that approach.

Brauner struggled to describe some of the scenarios that could occur, not because he did not understand them, but because it is difficult to do so in words with limited examples from his computer screen. Viro cautioned that it would be extremely important to fully document the intended behavior, corner cases, and such, because reconstructing them from the code "will be unbearably hard". Brauner said he had a 1600-line file that describes all of the corner cases, just for his own reference; he agreed that comments in the code and documentation will be imperative.

Brauner also poked Howells about his promise to provide documentation for the new mount API system calls. Brauner said that he has been a strong proponent of switching user-space programs to use the new API, has made the switch for a few projects, and that other projects (e.g. systemd) had switched as well; one of the main stumbling blocks is that he has to spend a lot of time explaining how the new system calls work. Viro apologized, though Brauner (and Howells) seemed to think the fault lay elsewhere. With luck, that gentle prod will spur work to finish up the documentation and get it merged.

Comments (10 posted)

Mounting images inside a user namespace

By Jake Edge
June 13, 2023

LSFMM+BPF

There has long been a desire to enable users to mount filesystem images without requiring privileges, but the security implications of allowing it are seriously concerning. Few, if any, kernel filesystems are hardened against maliciously crafted images, after all. Lennart Poettering led a filesystem session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit where he presented a possible path forward.

He started with an overview of the problem, noting that "everybody wants to be able to mount disk images that contain arbitrary filesystems" in user space, without needing to be root. Since malicious images could crash the kernel—or worse—the only way to do that is to establish some trust in the image before it gets mounted. He talked about some components that the systemd developers want to add that would allow container managers and other unprivileged user-space programs to accomplish this.

More specifically, code that is running in a user namespace can ask the host operating system to mount a filesystem stored in the contents of a particular file. It will require that containers have some limited access to an interprocess communication (IPC) mechanism to talk to the host OS. That is different than today's containers, which generally can only use the kernel API and, perhaps, communicate in a limited way with their container manager, he said.

There are multiple use cases for this feature, including unprivileged container managers that want to run containers from disk images, but also for tools that build container images. There are desktop application runtimes that want to be able to run apps from images as well. Essentially, any tool that wants to be able to work with disk images, but not have special privileges, could benefit.

There are a number of complexities for any solution. Some kind of trust needs to be established in the images before they get mounted; immutable images using dm-verity are easier in that regard, but there is a desire to also have limited support for writable images. Minimizing or eliminating the need for the host to enter the caller's namespace in order to perform the mount is also desirable. Recursion in the form of nested containers should be supported without needing to resort to special cases, as well, he said.

Poettering described how this all might work. An unprivileged process P, which might be a container manager, creates a user namespace U, but does not give U any user/group mappings. It then passes a file descriptor for U through an IPC mechanism to a service on the host, X, which could be a privileged process provided by systemd; X assigns a transient UID/GID range (64K of each, for example) to U. These transient ranges are a "key idea" of the feature; the transient ranges only last as long as the user namespace does and they are recycled when it goes away, unlike persistent UID/GID ranges. It is "dramatically different" to the way these ranges are handled today, he said.

X enforces a security policy on U that only allows a small subset of filesystem operations (open() for create, chmod(), and "a couple of other things") and only on mounts that appear in an allowlist, which is initially empty. So, initially, P cannot create any files. P can talk to Y, which is a different service, via IPC, passing it a file descriptor to U and another descriptor of an image file it would like to mount. It gets back a file descriptor, like one returned from fsmount() (in the new mount API), that corresponds to the mounted image with the ID-mapping from U already applied (using ID-mapped mounts). Y talks to X to get this new mount added to the allowlist and P can attach the mount file descriptor wherever it wants and join U if it has not already done so.

It looks like a lot of steps, he said, but for a client application it is fairly easy. The client simply makes an IPC call to get the user namespace set up and then a second one to get the mount. It can pass multiple images to Y to get multiple mounted filesystems and then it can attach them wherever makes sense in its directory hierarchy.

Instead of X and Y, he got more specific; he used the placeholders because the concept is entirely generic, so it could be implemented in other ways. For systemd, X would be systemd-userdbd and Y would be a new systemd-mntfsd service. The security policy he described for systemd-userdbd would be implemented using the BPF Linux security module (BPF-LSM). The images to be mounted by systemd-mntfsd would be in the discoverable disk image (DDI) format. More information about DDI (and other surrounding efforts) can be found in the report from last year's Image-Based Linux Summit.

These images have a GPT partition table and are separated into several partitions. One partition is for the filesystem, while another has the dm-verity information. There is a third partition with a signature for the root-level hash of the filesystem, which gets verified by the kernel using its internal keyring. If it passes, systemd-mntfsd will set up the filesystem and dm-verity, apply the user mapping, and return it to the requesting process. DDI makes it convenient to wrap each of those three parts together into a single image.

Another mechanism for trusting images would be to have a trusted directory on the host. Since only privileged processes should be able to write into that directory, systemd-mntfsd could be configured to allow requests to mount images from there. That provides a weaker level of trust but may be fine for some systems, he said.

Those two options (signed DDI and trusted directories) are already implemented and should appear in the next release of systemd. Another mechanism, which would allow mounting writable filesystems, is still being worked on. The idea would be that the requester (perhaps a tool building images) asks for a filesystem of a certain type and size that would be stored in a provided image file, which systemd-mntfsd would create (using mkfs) in the file; it would then add a dm-integrity sidecar file that tracks the changes to the filesystem image. Dm-integrity would use a hash with a key that is not accessible to the caller, so the sidecar file can only be (correctly) updated by the kernel. The caller can provide the image and the sidecar file at a later point and the mount service will be willing to mount it again. If the sidecar file is not present (or is corrupted), the image will not be mounted.

He was asked about using signed fs-verity files as well. He said that it is all being done in user space, so other mechanisms could be added if they make sense. His goal is generally to let the kernel make these trust decisions based on keys on its keyring, rather than "doing trust enforcement in user space", but others may want to do things differently.

Ted Ts'o suggested that systemd-mntfsd could copy an image file to a block device that is inaccessible to the requester, then run fsck on the filesystem image. If it passes that check, it could be mounted in a suitable fashion (e.g. nosuid, nodev) and handed off to the container without needing to use dm-verity. Poettering said that fsck is already being used in the writable case, "but it was news to me that this is the philosophy that filesystem engineers subscribe to". He noted that other filesystem developers were "shaking their heads", so he did not think that there was universal agreement that fsck was sufficient to detect malicious images.

Ts'o said that it would depend on the filesystem, so Poettering tried to get a commitment about ext4, but Ts'o hedged things a bit. He is "reasonably confident" that it is not possible to cause a "buffer overrun or privilege-escalation attack" with an ext4 filesystem that passes fsck. Denial-of-service due to an overly fragmented filesystem would be a possibility, though, so it "depends on what your threat model is", he said. Josef Bacik said that he just comes from a standpoint of being paranoid. He trusts that the Btrfs fsck does a good job to ensure that there is a valid filesystem, but it, like him, is imperfect. It sounds like a good solution, but he would be leery of trusting it in a high-security situation.

Jeff Layton asked about network filesystems. Poettering thought that might be less worrisome, but Layton assured him that it would not be. There is interest in being able to pass a directory file descriptor to systemd-mntfsd, which will bind-mount to that directory, apply the UID mapping, and return that to the requester, Poettering said. That is not particularly risky because the filesystem is already mounted in the system, which is perhaps analogous to the network-filesystem case. But it turns out that none of the network filesystems implement ID mapping, though Christian Brauner said that he had gotten it working for CephFS (with some caveats).

Layton said that a malicious server was just as bad or worse than a malicious image, but that NFS had recently added TLS support. One way to establish trust in that environment would be to only allow servers that can present a properly signed TLS certificate. David Howells raised the automounter as another thing to consider, while Steve French mentioned SMB. Poettering said that if there is a need to mount these kinds of things in containers, they can be added, "as long as there's some kind of sensible security story in place".

There is an unresolved problem that has cropped up, he said. LSMs cannot restrict manipulations of access-control lists (ACLs), so it is a way that the transient IDs in the user namespace (U above) could leak out into the rest of the system in a persistent fashion. Perhaps it is not a big problem, he said, but all of the other ways that these IDs can be persistently associated with filesystem objects (e.g. chown()) are being blocked. He is not too concerned, but it is a low-severity vulnerability.

He gave a demo at around 19:10 in the video of the session. He started systemd-userdbd in one window, systemd-mntfsd in another, and then handed a disk image to systemd-dissect, which mounted it using the new mechanism and then pulled it all apart. He ran it as an unprivileged user "and it just works". The user IDs are handled correctly and it is all "extremely simple". Furthermore, it is something of a showcase of recent kernel features, such as the new mount API (across namespaces) and BPF-LSM; they and a few others can be combined to provide this long-sought feature.

He is pleased with the result, because "it is tiny", is socket-activated so it is not running all of the time, and there is just a single socket for IPC that needs to be bind-mounted into container to make it all work. Brauner pointed out that the superblock is not owned by the user namespace where the mount is being done, "which means that all of the destructive ioctl()s" that exist for Btrfs or XFS are not available to the container. But the container does own the mount, which means it can unmount it. The ownership of the mount is separate from the ownership of the superblock, he said, which is a nice side effect.

An attendee asked whether the containers would have access to the image files after the mount had been done. If so, a container could modify the image, thus potentially crash or compromise the kernel that way. Poettering said that the containers may have access to those files, since they might own them, but that dm-verity is meant to prevent any changes; if the image file is changed, any read of that region will return an error. Other mechanisms, such as fs-verity and dm-integrity, would also provide that kind of protection. He noted that in the fsck scenario, Ts'o had said that the image would need to be copied to a location inaccessible to the container.

The session ended with a quick discussion of how a network filesystem might be mounted in a separate network namespace for the container. Poettering said that it was something to work out with the network-filesystem developers, since it would need to be a mount option of some sort. Howells said that it would straightforward to do that using the new mount API if it were deemed desirable.

Comments (106 posted)

Hardening magic links

By Jake Edge
June 14, 2023

LSFMM+BPF

There are some "magic links" in kernel pseudo-filesystems, like procfs, that can be—have been—(ab)used to cause security problems, such as a container-confinement breach in 2019. Aleksa Sarai has long been working on ways to blunt the impact of these magic links. He led a filesystem session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit to discuss the status of those efforts.

Sarai said that he worked on hardening for these links as part of adding the openat2() system call, but he removed some of that work before it was merged because the semantics were unclear. So, he wanted to have a discussion on those pieces to try to ensure that they make sense to everyone, that attendees are happy with them, and to avoid "having things thrown at me when I post them to the list".

At this point, openat2() has been merged and he is still working on libpathrs, which is a path-resolution library that allows those operations to be done safely in containers. The main thing he wanted to discuss was a draft patch for magic-link hardening, which lives on a branch in his GitHub repository.

A magic link looks like a symbolic link but is not one. They are described as follows in the symlink man page (under "Magic links"):

Unlike normal symbolic links, magic links are not resolved through pathname-expansion, but instead act as direct references to the kernel's own representation of a file handle. As such, these magic links allow users to access files which cannot be referenced with normal paths (such as unlinked files still referenced by a running program).

Classic examples of magic links are /proc/[PID]/exe and /proc/[PID]/fd/*. They allow processes in containers to potentially see kernel objects that should not be accessible to them, for example. openat2() allows callers to disallow following these links with the RESOLVE_NO_MAGICLINKS flag, which can aid non-malicious programs, but the hardening he wants to add would go beyond that.

As with the vulnerability in 2019, a container process could get a reference to its container-runtime binary on the host by way of /proc/[PID]/exe. It will not be able to write to that file while the runtime is running, but it can wait until the runtime isn't and then do so. He noted that people may be wondering why a container process has the rights to open a file for writing on the host, but that is, perhaps sadly, a requirement for today's container runtimes (such as Docker and Kubernetes), which run as root without any user namespaces, he said.

Today's container runtimes "do a variety of awful things" to stop this attack. In particular, right now they all copy the binary to an anonymous file created with memfd_create() every time a container is created; the memfd is then sealed. The end result is that "even if you can overwrite the damn thing, it won't affect other containers in the system". He thinks that everyone agrees that "this is all absolutely awful and should not exist", but it is unfortunately needed. He wants to solve the problem in the kernel and he believes that a general ability to restrict file reopening would also be useful, so that is part of his patch set as well.

The core of the patch set is that it will only allow reopens of magic links if the mode being requested is a subset of the mode set on the magic link file handle in the kernel. It would also add an O_EMPTYPATH flag to openat() (and openat2()) that allows the passed-in directory file descriptor to be used as the file descriptor of a file to be reopened. It would provide a mask mechanism to restrict reopen modes that can be specified at the time a file is opened with openat2(). Lastly, it would expose the reopening restrictions for files in /proc/[PID]/fdinfo/*.

He gave some further details of what it means to have to be a subset of the mode. The O_PATH flag to open() and friends simply requests a descriptor of the path of the file—it does not actually open the file itself. Assuming that no mask has been placed on a file, an O_PATH reopen of a regular file will allow any legal mode to be used; this is how things work today and that would not change. But for a magic link, which has its own "magic modes" that are different than those for regular files, an O_PATH reopen will copy the mode of the existing open file. Other kinds of opens (or reopens), like O_RDWR for read and write, will be handled in the usual way. All of the modes for reopens are based on the f_mode in the kernel struct file entry.

He wanted to know if those restrictions made sense. He believes they do, though there are some corner cases that need to be considered, but it does protect against the problem he is trying to solve. He also wanted to consider future-proofing the design, which might mean figuring out how directories fit into it as well.

David Howells asked if it made sense to add a separate system call just for reopen operations, but Sarai said that would not help. Lots of code is using the existing system calls to reopen files and those are not going away. By and large, reopen is not being used nefariously, in fact, container runtimes "don't just use this, we abuse it to hell and back", because they must. There are "certain security properties you cannot get without using it", he said.

Amir Goldstein asked if chmod() could be used to change the discretionary access control (DAC) permissions on the files, instead. But Al Viro pointed out, via the remote-access audio, that mode bits for symbolic links are completely ignored. Goldstein wondered if they could be made to matter for magic symlinks, then chmod() could be used to control access to the links. Viro did not think that was feasible. He pointed out that anytime /dev/stdin is opened, it actually resolves to /proc/self/fd/0, so the behavior of the magic links cannot be changed without "breaking the living hell" out of lots of different things.

Christian Brauner agreed that backward-compatibility is important. There are O_PATH opens in lots of other places at this point, for example in the pseudo-terminal (pty) handling. People regularly propose "fixes" for the /proc/self/exe problem because the current solution is not pleasant, so he thinks it makes sense to use Sarai's mechanism, make it work well, and head off further hacky fixes.

Viro asked what would happen if someone were to bind-mount to the location where /proc/self/exe points and then reopen via that path for write. Sarai agreed that was a problem, and one that is worth addressing, but as a practical matter for containers, it is not a problem because nearly all containers cannot do the bind-mount in question. Sarai noted that the /proc/self/exe attack is a problem because 99.99...% of containers are running as root and do not employ user namespaces. Brauner said that user namespaces are not a panacea, but they do block the problem with containers overwriting the runtime binary.

Sarai went through some of the problems with handling directories at a rapid pace, then shifted into restricting the execution of files. Right now, there is no way to restrict a file handle such that it cannot be used to execute the file contents using fexecve(); the DAC permissions can be used to restrict it, but once the file is open, the file descriptor cannot be passed to an untrusted process with execution blocked. The same goes for directories; you cannot restrict path resolution from an open directory file descriptor. Even if those things do not get implemented, the design of the restrictions that he is implementing should take those potential use cases into account.

Viro said that, currently, the write-permission bits on a directory do not affect whether files in that directory can be written and wondered if Sarai was suggesting changing the meaning of the directory permission bits in some fashion. Sarai said that he was not; if these changes were implemented, an O_PATH open of the directory could set its mask such that writing is not allowed, so another process would not be able to create a directory or regular file there using that O_PATH descriptor. Howells likened it to an access-control list (ACL) governing what could be done using the O_PATH descriptor.

Viro expressed skepticism about changing the behavior for directories in that fashion, but Sarai pointed out that it was effectively opt-in; those who want to do this would need to set the mode mask on the O_PATH file descriptor before passing it onward. Viro asked about bind-mounts and Sarai once again agreed that they are a problem, though the vast majority of containers are run in a mount namespace so that they are unable to create the mount in question. Which is not to say that he does not believe the problem should be solved, however.

Another question that Sarai had was about mounting on top of symbolic links, which works today; there is no way to restrict mounting on top of magic links, which is even messier. But Viro said the kernel should restrict mounts from locations in /proc/[PID]/* using the "no mounts here" inode flag. "I cannot tell you how happy I am to hear that", Sarai said, claiming he would write the patch as soon as he left the room. There was a bit more discussion of that as the session ran out of time, but it would seem to resolve many of the concerns Sarai had about mounting on magic links.

Comments (1 posted)

Retrieving mount and filesystem information in user space

By Jake Edge
June 14, 2023

LSFMM+BPF

In something of a follow-on from the mount-operation monitoring session the previous day, Christian Brauner led another discussion about providing user space with a mechanism to get current mount information on day two of the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit. The session also continued on from one at last year's summit—and likely others before that. There are two separate proposals for ways to retrieve this kind of information, one from Miklos Szeredi and another from David Howells, both of whom were present this year; Brauner's intent was to try to reach some kind of agreement on the way forward in the session.

Background

Brauner began by noting that user-space developers, Lennart Poettering in the back of the room in particular, have been asking him for a way to query mount and filesystem information. He said that Howells has proposed the fsinfo() system call and that Szeredi has worked on getvalues(), which has shifted to use the extended attributes (xattr) interface, as was discussed last year. There were other proposals in the mix, Brauner said; "we somehow need to come to an acceptable compromise" so that things can move forward.

People have different preferences with regard to the facility; his main one is that the information not be exposed as a filesystem, because "it is a giant pain for user space, usually". It might be good to hear Poettering's thoughts on what kind of API he would prefer, since systemd will be a user of it.

Szeredi noted that Linus Torvalds had suggested having a stat()-like system call that reported information from /proc/self/mountinfo. It could return a mixture of binary and textual information, which is seen by some as problematic; the simplest thing would be to simply return the text line from the mountinfo file for a requested mount. Beyond that, though, is a need to be able to list the child mounts (i.e. mounts on subdirectories) for a given mount. A new system call could simply return a list of extended mount IDs (of the 64-bit variety mentioned in the earlier session) for the children. Those are simply some of his ideas, Szeredi said, perhaps others have different ideas.

After Torvalds weighed in on the fsinfo() proposal, several people contacted Howells to ask that the interface not be textual because parsing text can be painful and is slow. Brauner said that he would prefer a non-textual interface as well. Jeff Layton asked for a recap of the objections to fsinfo(); Brauner said that he thought the main problem was that mount notification and the general query mechanism were combined in a single, huge patch set. It was a difficult patch set to review, which is part of what Torvalds was reacting to. Brauner thinks that splitting some of the functionality up over multiple system calls is not a problem; the days where it was truly difficult to add a new system call are over.

Szeredi had no major objection to a binary interface, but there are some pieces of mount and superblock information that are hard to represent in binary. There is a textual format that is already established; even if it is difficult to parse, there are already parsers available for the format. The performance objections do not stem from parsing a single entry in mountinfo but from having to parse the entire file, he said.

Poettering said that, from his perspective, the problems with textual interfaces is "that splitting out the fields is always nasty [...] because of escaping and figuring out what the delimiters are". The kernel is not good at having uniform logic for those kinds of interfaces either. If the fields are separated in the structure returned from some kind of query, a textual interface would be fine with him.

He would like to see a single, atomic call to return information, as with statx(), rather than have to make several calls to retrieve what is needed. Brauner asked if that single call needed to also return the child mounts or if it would be sufficient for it to provide the mount ID that could be used in a separate call to retrieve the child mounts. That would be fine with Poettering; he would like information about a kernel object to be returned in an atomic fashion, but "I have no illusion that getting an atomic view of more than an object" is sensible. Brauner said that the mount table can be constantly changing, so getting a snapshot of it is inherently racy; Poettering agreed and said that was not what he was looking for.

Extensible interface

Brauner proposed a starting point for the API to consist of fsinfo() (or some other name, "we can squabble about this") that had a structure with a core set of information that is useful for user space. That structure could be extensible, "we know how to do this, we've done this before", even though some people do not like extensible structures that are versioned by size. It would take a mask of what information was being requested and it would return a mask with which of those were available. There can be both textual and binary information in that structure.

Amir Goldstein said that it is important to be able to query for filesystem-specific information, which Szeredi's xattr-based API could support. In fact, it would have been nice to have a way to do that for statx(), as well, since some filesystems do have inode-specific information that is exported. There are virtual xattrs that some filesystems support for that, he said. Poettering and others seemed to think that was a reasonable approach, though there may be some wrinkles to iron out with it.

Steve French said that CIFS makes various types of filesystem-specific information available in /proc, though it is not clear to him how user space could go from a file descriptor to get to those entries. Ted Ts'o said that he is leery of mixing filesystem-specific information queries in with the more general query mechanism, in part because a simpler proposal will be easier to review. It is clear what the use case for the mounted filesystem information is, Ts'o continued, but much less so for all of these esoteric, filesystem-specific bits; combining the two may add more complexity for little gain. Querying for that extra information can be addressed separately. The virtual xattr approach is contentious, with some, including Christoph Hellwig, finding it to be a "radical abuse of that interface"; even if that is not a reasonable position, he would rather avoid that particular battle.

Josef Bacik noted that the mount options are the only filesystem-specific information that would be returned from the generalized query; he wondered if the mount options, beyond attributes like read-only, were needed by user space. Poettering said that systemd is interested in the universally unique ID (UUID) for the superblock, but Brauner cautioned that exposing the UUID is more complicated than it might seem. Some filesystems generate a UUID, but others do not; some filesystems use the UUID to generate a filesystem ID (FSID), but not always. For example, XFS generates the FSID from the block-device information. So exposing that information requires additional work, but if the query mechanism is extensible, that can all come later.

Brauner suggested that the filesystem-specific question could be set aside for now. A core structure could be defined that is generic for all filesystems, then another text-based system call for filesystem-specific options (e.g. mount options) could be added.

Poettering would prefer to get all of the mount options together in a single call, rather than one-by-one in multiple calls. Howells said that he wanted something like that too for a "mount supervisor" that he wants to create. The supervisor would intercept mount requests in a mount namespace and allow or deny them based on the mount options; it could also be used for NFS automounts, he said.

Poettering noted that the util-linux utilities use its libmount, which gathers up the mount options so that various tools can report them; if the idea is to support those use cases, the mount options are going to need to be available. Bacik said that it seems perfectly reasonable to him to just provide the mount options in the form of a string (or a list of single-option strings). That code is already present for the mountinfo file. Ts'o agreed that providing filesystem-specific mount options that way made sense. If there is a need for something with more structure for mount options, it can be added later.

xattrs?

Szeredi said that there is already a system call that can be used to get filesystem-specific options: getxattr(). But Ts'o said: "I will let you fight with Christoph about using getxattr() for something that is not a real extended attribute". Ts'o does not think it is a good generic approach, though individual filesystems can do whatever they want. There are other problems with using getxattr(), Howells said, including mounts that are not reachable via a path or a file descriptor.

From afar, Al Viro asked what a mount supervisor is going to be able to do with a mount option that refers to a file descriptor by number. Poettering added that the options listed in mountinfo for automounted filesystems have things like "fd=5", which is not meaningful to other processes. Howells said that he had added the 64-bit mount IDs that could be used to identify mounts; those could be queried using fsinfo() or some other interface.

Poettering also noted that xattrs are normally a property of an inode, so making them suddenly return information about the filesystem is a bit weird and will be hard to explain; that would be another reason to find a different style of interface, he thought. Another remote participant, Darrick Wong, wondered if fspick() could choose a filesystem based on an FSID and if the file descriptor it returned could be used to get filesystem-specific virtual xattrs, which might actually route around Hellwig's objections. Wong was guessing that Hellwig did not like mixing the regular and virtual xattrs so that you could not tell the difference if someone had simply added a regular xattr with the same name as a virtual one.

Howells said that fspick() takes a path or file descriptor, so you cannot use an FSID. Layton said that CephFS had its own xattr namespace, so virtual xattrs could be distinguished from the regular ones. But Brauner does not think the xattr API is a good one, so it is not one that should be used for this purpose. "This is a really broken API in my opinion." It is type-unsafe and convoluted; access-control list (ACLs) were moved out of xattrs, so he hopes that filesystem attributes can be moved out as well.

Brauner suggested that converging on a slimmed-down version of fsinfo(), under that or some other name, and adding a separate system call to retrieve the mount options in textual form. That should provide util-linux and systemd what they need. Layton suggested adding UUID into fsinfo() even if all filesystems did not support them (yet); if the request/response mask is used, those filesystems can just not report it.

Brauner said that a goal of getting that into the kernel by the end of the year, or early in 2024, seemed reasonable. It is mostly a matter of copying the code for statx() and hacking it up to be suitable for generic filesystem information. Goldstein added, "and make sure it's extensible", which to Brauner sounded like Goldstein was volunteering to do the work. Rapid backpedaling to general laughter was the result.

Szeredi wondered about getting child-mount information, but Brauner thought there had been agreement on a new system call for that. There was some discussion about ways to shoehorn that information into fsinfo(), but Brauner and others are resistant to the idea of variable-length arrays embedded into structures. Eric Biggers asked that any new system calls that get added for this have both documentation and tests. The session wound down shortly thereafter, but not before Brauner, with a big grin, said "hopefully we can all remember the good spirit" of the session on the mailing list when patches start getting posted.

Comments (1 posted)

Reports from OSPM 2023, part 1

By Jonathan Corbet
June 13, 2023

OSPM

The fifth conference on Power Management and Scheduling in the Linux Kernel (abbreviated "OSPM") was held on April 17 to 19 in Ancona, Italy. LWN was not there, unfortunately, but the attendees of the event have gotten together to write up summaries of the discussions that took place and LWN has the privilege of being able to publish them. Reports from the first day of the event appear below.

Reports from day 2 are also available.

Improving system pressure on CPU-capacity feedback to the scheduler

Author: Vincent Guittot (video)

The factors that can impact the compute capacity of CPUs can be split into two different types. The first type impacts the performance of the CPU by acting on the frequency whereas the second one reduces the number of cycles available for the tasks. The session was about the first type of pressure, for which, we can distinguish four kinds of impacts: hardware mitigation, which can act at KHz frequency, firmware/kernel mitigation, which can change at a period of 10 to 100ms, capping of the power budget provided to the CPUs, and action by user space, which happens at an even lower frequency with a multi-second period.

The current implementation of thermal pressure in the kernel uses one input that is then used for PELT filtering or to estimate the actual maximum performance. Although we have only two users of this input (the Qualcomm limits management hardware and the CPU-frequency cooling driver), other parts also cap or boost the maximum frequency. In the end we have one interface for three different behaviors: instantaneous, average, and permanent pressure ("permanent" meaning that it applies for one second or more, which can be seen as permanent from the scheduler's point of view).

The high-frequency pressure should keep the current interface. The medium pace needs a new interface able to handle several inputs submitted by firmware and/or the kernel. The user-space pressure, such as changing the power mode of a laptop when plugged into the wall power supply, should trigger the update of the original CPU capacity in order to align it with the new maximum performance.

Keeping a gap indefinitely between the actual maximum capacity of a CPU and value stored by the scheduler can impact the behavior of load balancing and task placement. It has been asked if such behavior could be related to the KVM patches that have been submitted to scale the frequency of the host CPUs with guest performance requests. This seems to be a different problem as they want to apply the guest's request to vCPU threads.

While studying the current thermal-pressure implementation, I found strange behaviors in the energy model (EM) that need consolidation. The implementation uses the current maximum frequency and CPU-capacity pair to compute the energy cost, and they can change at run time (although not that often). The energy model only needs to save a coherent pair of frequency and capacity values that were used when building the model, but they don't have to be the current maximum ones.

The maximum compute capacity can then be changed without impacting or rebuilding the energy model. It has been noticed during some tests on an Arm system that has boost frequency that there can be a mismatch between the energy model and the CPU-frequency governor. It has been raised that energy-aware scheduling (EAS) should probably be disabled when boost frequencies are selected, because the kernel reverts to a performance-oriented mode when CPU utilization reaches 80%, but the tests have shown that the case was possible and others suggested that the CPU could be not yet overutilized when tasks wake up.

Then the discussion moved on to the reason that the energy model is not used with Intel processors. It has already been suggested to use the energy model for Intel processors and it seems that there is no real blocking reason other than there are no fixed operating power points (OPPs), which has probably blocked the developers from moving further on trying to use the energy model. There are patches for the support of an artificial energy model which should cover the concern.

The discussion moved back to the system pressure proposal, which can create a situation where no CPU in the system appears to have 100% capacity. Although it should not be a major problem, the detailed impact of this new situation must be carefully studied. This new interface proposal will replace the current thermal pressure mechanism in the scheduler, which includes load balancing and EAS. It has been asked if we could save the compute capacity of each OPP as this can be non-linear with frequency, which violates the current assumption. Then the discussion ended with some suggestions about disabling EAS and the energy model at boot until everything initializes, like for RCU, but the system doesn't really know when the CPU-frequency driver will be loaded and the system will be ready.

Utilization Boosting

Author: Dietmar Eggemann (video)

Per-entity load tracking (PELT) is the Linux kernel's task-size tracker, where "size" can be determined by load (based on runnable — waiting and running — time and weight), utilization (based on running time), or runnable (based on runnable time). Utilization drives CPU frequency and Energy-Aware Scheduling (EAS) run-queue selection. The responsiveness of PELT is still considered too low to react fast enough to utilization changes during task ramp-up. With util_est (utilization estimation), which caches max utilization before sleeping for wakeup, and Uclamp (utilization clamping) which allows per-entity utilization tuning from user space, there are already kernel mainline features available to improve the situation.

During the talk, additional ideas to improve the utilization ramp-up, which had appeared on the linux-kernel mailing list over the last eight months, were discussed. It started with the proposal to be able to change the PELT half-life at run time. Some Android systems use this to be able to run gaming workloads with a smaller PELT half-life value of 16ms or 8ms instead of the default 32ms. They claim better frames-per-second (FPS) rates with only moderate energy increase due to faster PELT signal rising and decaying.

The design was rejected because it isn't clear which specific problem it solves. Two mechanisms, util_est_boost and util_est_faster, were reviewed that improve the responsiveness of a single completely fair scheduler (CFS) task without boosting the whole system by changing the PELT half life. The use of runnable alongside utilization came out of the util_est_faster discussion on the mailing list. A situation in which runnable is larger than utilization indicates run-queue contention. Using the greater of the runnable and utilization values helps boosting CPU frequency in these scenarios. The UI benchmark JankbenchX on a Pixel 6 device has shown that this can seriously minimize jank frames (frames which don't meet the required rendering time).

With API level 31, Android started to provide the Android Dynamic Performance Framework (ADPF) CPU-performance hints feature which allows per-task boosting using Uclamp.

As a stand-alone kernel mainline feature, improving CPU utilization by runnable for faster CPU frequency ramp-up can work alongside Android's CPU performance-hinting user-space boosting mechanism. It is unlikely that a PELT half-life runtime modifier gets into kernel mainline, even though Android would like to have it as a tuning parameter for CPU performance and power management.

Results from using SCHED_DEADLINE for energy-aware optimization of RT DAGs on heterogeneous hardware

Author: Tommaso Cucinotta (video)

In this talk, some results from the AMPERE EU project were presented, including prototype modifications our research group made to SCHED_DEADLINE, to accommodate the requirements of application use cases belonging to the automotive and railway domains. We considered in the project some different reference platforms, including a Xilinx UltraScale+ board with FPGA acceleration, and an NVIDIA Xavier AGX board with GPU acceleration. Throughout the three-year project, running from 2020 to date, we implemented a number of components, most of which were released under an open-source license.

We implemented PARTSim, a power-aware and thermal-aware realtime-systems simulator for big.LITTLE platforms, in which we modeled the variability in execution time and power consumption of a range of boards (ODROID-XU4, Raspberry Pi 4, Xilinx US+ ZCU102), so that we can simulate the temporal behavior of applications on these boards under a variety of configurations of schedulers. We also prototyped, in this simulator, big.LITTLE CBS, a variant of a SCHED_DEADLINE-like, realtime scheduler that, on every task wakeup, is able to schedule the task on a core (among the one it woke up upon, a core from the same core island, or a core from another island) that is most convenient from a power-consumption perspective.

We prototyped APEDF, a variant of SCHED_DEADLINE providing adaptive, partitioned, earliest-deadline-first (EDF) scheduling of realtime tasks, requiring no user-space API changes. APEDF can automatically partition SCHED_DEADLINE tasks among the CPUs using simple heuristics, consistently handling the EDF schedulability condition for all cores, thus guaranteeing no that deadlines are missed when proper admission conditions are met. More importantly, APEDF can provide guarantees in the presence of DVFS, providing a sound and consistent foundation to the power-management logic in schedutil in the Linux kernel, which unfortunately is not very effective in presence of tasks using the default Global-EDF behavior of SCHED_DEADLINE. See the separate talk about "SCHED_DEADLINE meets DVFS" for more details and experimental results.

We developed some theoretical results about the schedulability of SCHED_DEADLINE tasks on big.LITTLE platforms, developing enhanced admission tests that guarantee the ability to host the admitted workload under the assumption of an APEDF scheduler, and even in presence of dynamic migrations of tasks as advocated by APEDF with a worst-fit placement, or big.LITTLE constant-bandwidth server (BL-CBS).

We implemented ReTiF, a framework for declarative deployment of realtime tasks on multiple cores. This is an open-source daemon and associated client library that can receive requests from clients to "go realtime", providing diverse and heterogeneous information about their timing. For example, some tasks might request specific priorities, others might declare their periods only, others might also declare their run times. ReTiF partitions the tasks among the available CPUs using different scheduling strategies as needed, (e.g. POSIX fixed-priority, rate-monotonic, or SCHED_DEADLINE).

We also tackled the cumbersome problem of minimum-power configuration for embedded platforms with multi-core, DVFS-enabled, multi-capacity-core islands (e.g., Arm big.LITTLE), GPU acceleration, and/or FPGA acceleration, to deploy a set of complex realtime applications. The applications have been modeled as realtime directed acyclic graphs (DAGs) of computations, where some of the functions could be realized either in software running on a CPU, or as a kernel running on the GPU, or as a hardware IP deployed in FPGA slots, if available.

A few experimental results have been shown, where we optimized the deployment of randomly generated realtime DAGs with end-to-end deadlines and optionally FPGA-accelerated functions, minimizing the expected power consumption on a big.LITTLE ODROID-XU4 and a Xilinx UltraScale+ board. We used a mixed-integer quadratic constraint programming (MIQCP) optimization approach, showing how the software tasks have been scheduled with SCHED_DEADLINE using periods automatically computed by the optimizer. We also showed that, by adding the secondary objective of maximizing the minimum relative slack, we could make the schedule more robust with the same power configuration, managing to remove deadline misses occurring in a few runs.

We also reported that, during the experimentation, we needed to swap the order of deadline.c and rt.c among the scheduling classes in the Linux kernel, thus giving POSIX realtime tasks priority over deadline tasks. This was needed because, in order to use FPGA-accelerated functions, we needed an RPC-like interaction with a daemon, which in turn could trigger on-the-fly reprogramming of one of the available FPGA slots, making use of a reconfiguration kernel thread. Both these threads have low processing requirements, yet when they are activated, they need to complete as soon as possible, or the whole DAG gets delayed.

Modeling these interactions in the DAG topology would have been cumbersome, and scheduling these threads with SCHED_DEADLINE didn't seem viable due to the small implied run times and short deadlines (thus high worst-case utilization). Instead, it was more convenient to model these similarly to how we deal with interrupt drivers in this context i.e. considering a run-time overprovisioning factor that needs to be there anyway, to account for external interference with the deployed realtime application DAGs. Interestingly, the possibility to swap rt.c and deadline.c in the kernel, or even to possibly make it a tunable sysfs option, was discussed for other reasons in other talks throughout OSPM.

The overcommit bandwidth scheduler

Author: Steven Rostedt (video)

Rostedt started out by explaining what the system layout is for ChromeOS. Chromebooks are used for video games, video conferences and many other use cases. There are millions of Chromebooks out in the world. The focus of ChromeOS is the Chrome browser, which is where most applications run. There are thousands of threads that handle all of the services of the Chrome browser. The threads that are the focus for this talk are the renderers and compositors, as well as the threads that handle the user interface (UI). There are other threads of concern that handle things like garbage collection.

ChromeOS uses control groups (cgroups) to contain these threads and handle their priorities. There is a render cgroup that contains two child cgroups that are the foreground and background cgroups. There's also a UI cgroup. The render foreground and the UI cgroups are the high-priority ones. Although the high-priority tasks are in these cgroups, there may be some other threads in these cgroups that are related to the high-priority threads though not of high priority themselves, but they still share the cgroup.

There's an issue with cgroups. The priorities of one cgroup are not acknowledged by another cgroup. Cgroups are like tasks, where even a low-priority cgroup will get a share of the CPU. If a low-priority cgroup is running and a high priority task in another cgroup wakes up, it has to wait for the other cgroup to finish. The Linux kernel's SCHED_OTHER scheduler (the completely fair scheduler, or CFS) is, according to Rostedt, "too fair". Rostedt created a test to run a spinner that updated a counter to see how much it ran. With one thread at nice value -20 (high priority) and one thread at normal nice value 0, the high priority thread ran for 87% of the time. When he added 10 threads with the nice value of 0, the high priority thread ran for only 40%. With 1,000 threads at normal nice value, it ran less than 1% of the time.

Another issue that Rostedt brought up was the migration process, which picks the next task to run rather than the highest-priority task. This means that the load balancer may keep the highest-priority tasks on the same CPU and the low-priority task on another.

There's a new scheduler that people are talking about, called "Earliest Eligible Virtual Deadline First" (EEVDF), but is from a paper written in the 1990s. This adds a "lag" value that keeps track of when the task was able to run compared to what it was supposed to run. If the lag is positive it means it has not run for as much it should and is eligible to run. If the "lag" is negative, then it is not eligible to run, as that means the task has executed on the CPU more than it was supposed to (for example, disabling preemption kept it from leaving the CPU). Rostedt said that this helps a little with latency, but still has the dilution of the higher priority task.

Rostedt then brought up the realtime scheduler, which does help the situation, but it becomes an issue when there's more than one high-priority task to run on the CPU. The FIFO scheduling class will run a task until either a higher-priority task preempts it or it goes to sleep. Round-robin, instead, will schedule between various tasks of the same priority, but acts like the FIFO class against tasks of different priorities. Both of these are not good for untrusted tasks, as they can take over the CPU. Videos watched on the Internet are run by the renderers, but can we trust running them with a realtime policy? Any changes that are done must still be good for hard realtime and POSIX compliant (no regressions).

Next Rostedt brought up the MuQSS scheduler (Multiple Queue Skiplist Scheduler) by Con Kolivas. Con's work influenced the creation of the CFS scheduler. He came back with other schedulers, but his latest one is MuQSS. It uses a trylock operation for access between the different CPU run queues, and if it doesn't get the lock, it just doesn't do anything. It also introduces a skiplist that is similar to a red/black tree that is not as balanced, but which is faster to add and remove items. The MuQSS scheduler is similar to the EEVDF scheduler. Some tests comparing the schedulers were run were on the 4.14 kernel, as that was the one that was easiest to port to ChromeOS. But tests were done against CFS in the 4.14 kernel for comparison.

Rostedt brought up a new proof-of-concept scheduling class called SCHED_HIGH, for, which he made up the name but not the concept. The idea is that this scheduling class would sit between realtime and SCHED_OTHER. He based the scheduling class on CFS, but said that it was difficult to use it, and he should probably have used EEVDF. He stated that if a new scheduler class is created, it can have a new API and be developed without having to deal with the limitations of regressions since everything is new. He listed requirements, where one is that tasks can get different priorities within the class, but still get time slices. The other requirement is that it does not starve SCHED_OTHER tasks.

Next up was Youssef Esmat, who presented the results of the tests. He explained that there were two Chromebooks that were used. One with two cores and one with four cores. The first test used Google Meets with 16 participants and would measure key-press timings. It also measures the time between the event and when the graphics show on the screen. The second test was with 49 participants and would measure the time when the mouse went over a window and displayed an icon.

The critical tasks have a nice value of -8. The realtime tests set the tasks to priority one. With EEVDF, it used the latency_nice of -20 (fastest time). The critical tasks ran in the SCHED_HIGH class. MuQSS used a nice of -10 instead of -8 as the MuQSS documentation states that gives a 16ms cycle.

Youssef then showed the results of the tests. First the baseline of the current method. During this work, Youssef found that one of the critical tasks was not set to a high priority. Doing that made a significant improvement. Dropping the key-press latency down on the two-core system by more than half, and on the four-core system by around 20%. Then, moving the critical tasks out of the cgroups also made the values much better. Dropping mouse latency on two-core system from 121ms to just 71ms. The other tests, including the four-core test, also improved across the board.

Moving the critical tasks to the realtime class improved even more. Switching from CFS to EEVDF actually increased latency on every test on the two-core machine, and the four-core system went up slightly. Dropped frames increased as well with EEVDF. Someone in the audience asked if latency-nice was updated for CFS, which Youssef affirmed, but wasn't able to finish those tests in time for this talk. The SCHED_HIGH prototype performed on par with the realtime class. The tests showed that MuQSS, SCHED_HIGH, and realtime performed the best, then after that, it was just moving everything out of the cgroups (MuQSS, SCHED_HIGH and RT also did not use cgroups for the critical tasks). MuQSS actually did the best with the dropped frames (least amount of dropped frames).

It was asked if the tests that were used are publicly available. Rostedt and Youssef said that they are in the Chromium repository.

Hierarchical scheduling

Author: Steven Rostedt (video)

The realtime and deadline schedulers are specific for "hard realtime" applications. This talk, instead, was about "soft realtime" schedulers. The FIFO realtime scheduler, which is priority driven, running the highest-priority task until it is preempted by a higher-priority task or it sleeps, is not easy to map to applications, as applications may have dynamic priorities depending on when their deadline is.

The round-robin (RR) scheduler will give time slices for tasks of the same priority, but the amount of slice is not something that can be easily changed. There's no definition in POSIX that defines that time slice.

The deadline scheduler is for periodic realtime tasks. Rate-monotonic has static priorities and is easier to implement but cannot guarantee 100% utilization of the CPU. The EDF scheduler can get 100% CPU utilization, but also requires calculations to know which task can be scheduled next. But that is only for a single CPU. Global EDF allows for more than one CPU but with restrictions.

Basically, "realtime is hard!". Rostedt argued that the Linux kernel is hard realtime where it has the utilities to handle hard realtime tasks. Some argue that Linux is soft realtime, but if a deadline is missed then the system fails, which is hard realtime and not soft realtime. The issue that some have is that Linux is not mathematically provable. But Rostedt said that is "quality of service" (QoS) and not realtime. Rostedt came to the conclusion that Linux has a "hard realtime design", where the OS is designed to be realtime but it is not guaranteed to be so (though it's a bug if it is not).

Rostedt then introduced an "over-commit scheduler" which could be implemented as EEVDF. The system allows for tasks to ask for more than 100% of the CPU. But when that happens, the tasks will all fail their deadlines, but they will fail it "fairly". That is, all will miss their deadline by the same percentage.

Rostedt argued for a new scheduling class to give this characteristic of over-committing the CPU. The idea is to keep it separate from the other hard realtime schedulers in order to not "dilute" them, or as Daniel Bristot de Oliveira shouted out, "duct-tape development". This scheduling class should act as hard realtime when it does not use more resources than the CPU can supply, but not fail if it does exceed them. Instead, the QoS would just suffer.

Rostedt stated that there is a need for a scheduling class that gives a general idea about realtime so that applications do not need to know the details of how realtime works. An application should just state that a thread wants a percentage of the CPU, but could possibly have multiple threads that ask for a percentage that may add up to more than 100%. This should be acceptable with the caveat that they will lose their QoS.

He added that there should be an easy way for applications to tell the kernel which tasks are important and which are not. Currently the kernel uses heuristics to determine this. Dhaval Giani asked what the user-space tasks would ask of the kernel. Rostedt mentioned that there's a scheduler that switches from deadline to SCHED_OTHER when a task's declared run time is exhausted. Juri Lelli stated that SCHED_DEADLINE originally did this, and added that everything that Rostedt mentioned so far has been done in the past (but not accepted) and is still doable. Bristot mentioned that problems happen when tasks run in small bursts. Where a task runs a little and then sleeps and runs again, it can take advantage of run time that it builds up to run more than it should later and harm other tasks.

Peter Zijlstra chimed in stating that the SCHED_HIGH tasks need to know about hard realtime tasks that interrupt them, otherwise their accounting will go wrong. It's best to use the hard realtime scheduler as a server for the other class. He then continued to say that static priorities are a nightmare and should be removed. Others stated that they are still being used and Zijlstra said that was due to POSIX, which made a mistake and we have to live with it.

Lelli said that the SCHED_DEADLINE could be used, but Rostedt said that the tasks get throttled when their run time expires. Lelli said it is possible to have it continue to run.

Zijlstra brought up that any bandwidth-oriented scheduler below the realtime class would be broken because they could run indefinitely. Rostedt wondered if rate monotonic would work, as it does not have the calculations that deadline schedulers have, it just picks the task with the smallest period and lets it run. Rostedt then asked if the issue is that the realtime classes would cause issues with lower classes, like CFS, where CFS would not get to run and it would need to handle what to do when it does get to run again, but now needs to make up for the large hold that the hard realtime schedulers caused it. Zijlstra said that was part of it, but it also has to do with all the cgroup code that also needs to account for this.

Zijlstra ended by stating that we can get what we want by adding a patch to make the tasks downgrade to SCHED_OTHER when they run out of their time slice. There's also the issue with allowing unprivileged tasks to use SCHED_DEADLINE.

The discussion ended with the idea of possibly trying out SCHED_DEADLINE with sched reclaim enabled to allow tasks to run when their run time is exhausted.

Rewriting RT throttling

Author: Joel Fernandes (video)

Realtime throttling issues in ChromeOS have been a concern and barrier to the usage of realtime on ChromeOS. One of the issues is that realtime tasks can be bursty — if a realtime task takes a significant amount of time, it will undergo realtime throttling. Even if it did not, it will starve CFS tasks until it is throttled. The default settings let realtime run for 0.95 seconds before throttling.

At the conference, there was a notable pushback against the idea of improving the realtime scheduler in ChromeOS to fix throttling issues. While no specific technical reasons were cited, attendees generally suggested that the effect of realtime scheduler throttling should be addressed through alternative means. One recommendation was to revisit an older set of patches by Zijlstra and Lelli that involves running CFS tasks from a SCHED_DEADLINE reservation. The SCHED_DEADLINE server infrastructure would replace the existing realtime throttling mechanism.

Fernandes pointed out that this approach still faces the issue of what happens once throttling kicks in. He suggested that this problem could be resolved either by using a smaller period for the CFS deadline task or by implementing demotion of realtime to CFS, with the former being easier to accomplish. Lelli expressed confidence that, with some tweaks and tuning, the approach could be made to work effectively and offered to collaborate on a call to further discuss the matter. Thomas Gleixner proposed describing the flow of scheduling events during an input event through an API or a similar solution to facilitate better scheduling.

(See this article and this one for recent coverage of realtime throttling).

Saving power by reducing timer interrupts

Author: Joel Fernandes (video)

When profiling the ChromeOS video playback use case, it was found that high-resolution (HR) timers can cause high power usage due to reduced batching. The conference attendees showed a general interest in these findings, discussing the history and interactions of HR timers with CPU idle states, as well as possible solutions to address the problem. Experts like Gleixner and Daniel Lezcano provided valuable insights into the issue and shared their experience in understanding the intricacies of HR timers.

Gleixner suggested a potential reason why dynamically toggling HR timers based on use cases might not be working: the need to stop the tick when transitioning from high resolution to low resolution. He recommended collecting more traces to better understand the issue. Although Gleixner was open to the idea of turning off HR timers dynamically, he also emphasized the importance of addressing the problem in user space. He shared a past example of the Alarm user-space applet, which experienced a bug when HR timers were introduced, as it would wake up too frequently.

Comments (none posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: Parallel Programming update; Debian 12 "bookworm"; PostgreSQL documentation; 2022 Tracing Summit; Quote; ...
Announcements: Newsletters, conferences, security updates, patches, and more.

Next page: Brief items>>