Addressing priority inversion with proxy execution

By Jonathan Corbet
June 9, 2023

Priority inversion comes about when a low-priority task holds a resource that is needed by a higher-priority task, with the result that the wrong task is the only one that can run. This problem is arguably most acute in realtime settings, but it can happen in just about any system that has multiple tasks running. The variety of scheduling classes provided by the Linux kernel make handling priority inversion a difficult problem; the latest version of the proxy execution patch series points toward a possible solution.

To understand priority inversion, imagine that a low-priority, background task acquires a mutex. If a realtime task happens to need that same mutex, it will find itself blocked, waiting for the low-priority task to let go of it. Should yet another task, with medium priority, come along, it may prevent the low-priority task from executing at all, meaning that the mutex will not be released and the realtime task will be blocked indefinitely. That is exactly the sort of outcome that the priority mechanism is intended to prevent.

A classic solution to priority inversion is priority inheritance. If a high-priority task finds itself blocked on a resource held by another, it lends its priority to the owning task, allowing that task to complete its work and release the resource. The Linux kernel has supported priority inheritance for a long time, but that is not a complete solution to the problem. Deadline scheduling complicates the situation, in that it is not priority based. Since a task running in the deadline class has no priority, it cannot lend that priority to another task. So priority inheritance will not work with tasks using deadline scheduling.

Kernel developers have been working on this problem for some time; it was discussed at the 2019 and 2020 scheduling and power management (OSPM) conferences, for example. The current patch set, posted by John Stultz but containing the work of a number of developers, shows the current state of this work. At its core, "proxy execution" involves letting a blocked process lend its entire scheduling context to another task holding a needed resource.

To be able to implement proxy execution, the scheduler needs to know exactly which resource a blocked task is waiting for. The task_struct structure already contains a struct mutex pointer called blocked_on that serves exactly this purpose but, in current kernels, it is only compiled in if mutex debugging is enabled. The patch series makes this field unconditional so that this tracking is always performed. The mutex structure already has a pointer to the task that owns it at any given time; the patch series makes that pointer available to the scheduler. The combination of these two pointers allows the scheduler to locate the task holding the resource needed by another task.

The task_struct structure contains a vast amount of information about a running task. The patch series recognizes that this information serves two different roles relevant to scheduling: the execution context and the scheduling context. The execution context contains the information needed to run a given task, while the scheduling context describes how the task will be treated by the CPU scheduler. To enable a logical separation of these two roles, the rq (run queue) structure gains a second task_struct pointer for the scheduling context. Most of the time, the execution and scheduling contexts for a given run-queue entry will be the same, but proxy execution may cause them to differ.

The scheduler's run queues hold tasks that are in a runnable state — they would be on a CPU if one were available for them. When a task blocks to wait for a resource, it is removed from the run queue until it becomes runnable again. One of the more interesting changes made by this patch set is to cause blocked tasks to remain on the run queue, even though they are not, in fact, runnable. That causes the scheduler to pick the first task that it would run, assuming its resources were available, rather than the first task that it can run.

This mechanism may thus leave the scheduler trying to run a task that can't actually run; this is the time for the scheduler to give the CPU to the task holding the resource blocking the execution of the task that the scheduler really wants to run. With the infrastructure described above, implementing this proxy execution is conceptually simple. If the chosen task is not runnable, then follow its blocked_on pointer to find the task it's waiting for, give that task the blocked task's scheduling context (thus boosting its position in the run queue), and run it instead. When the boosted task releases the mutex it is holding, it will lose the other task's scheduling context, and the higher-priority task will be able to continue. Problem solved.

Naturally, there are a few complications. The task holding the needed mutex may, itself, be blocked on yet another resource, so the scheduler will need to be able to follow a chain of blocked-on relationships. A scheduling context may include a constraint on which CPUs may be used, so a task running as a proxy may need to be migrated to a different CPU first. The scheduler has to keep proxy execution in mind before deciding to migrate a task to another CPU as part of its normal load balancing. CPU-time accounting also becomes more complex; the time used by a task while running as a proxy for another should be charged to the running task, but it is taken from the higher-priority task's time slice to maintain scheduling fairness.

The kernel normally tries hard to spread realtime and deadline tasks across the system's CPUs so that all of them can run, but proxy execution binds the tasks involved onto the same CPU. If one of them is to be migrated to achieve the needed separation, both must be — and here, too, there may be a chain of blocked tasks to worry about. One of the most complex patches in the series attempts to solve this problem. Rather than create "some sort of complex data structure" to track the ability to move tasks, it changes the load-balancing code to simply search through the list of potentially movable tasks. The idea here is that, once the behavior is seen to be correct, optimizations can be applied.

The patch series has not received any review comments as of this writing; all reviewers, it seems, are blocked on other tasks. Given the complexity and long history of this work, though, it seems unlikely that this version will be the last one. Even seemingly simple changes can be hard to apply to the CPU scheduler without creating subtle problems, and this change is not simple.

Index entries for this article
Kernel	Scheduler/Proxy execution

Addressing priority inversion with proxy execution

Posted Jun 11, 2023 17:06 UTC (Sun) by bredelings (subscriber, #53082) [Link] (1 responses)

This kind of thing sounds like it creates an incentive for a parasitic task to grab a whole bunch of mutexes and hold them for longer than necessary. It could then run with the priority of the highest priority task that is blocking on the mutexes.

At a higher level, there seems to be a lot of faith placed in other tasks that hold mutexes that giving them more CPU time leads to some kind of forward progress towards dropping the mutex, and that tasks hold mutexes for the minimum possible code section. Maybe the whole mutex approach already assumes that kind of trust though.

Not sure I'm understanding the implications here. I'm sure people have thought about it already -- would be interested to hear

Addressing priority inversion with proxy execution

Posted Jun 11, 2023 17:16 UTC (Sun) by willy (subscriber, #9762) [Link]

The mutexes in question are kernel mutexes, not user mutexes. So any choice of "which mutex to take" is under the control of the kernel, not the application. While I haven't looked at the patch in question, I imagine the task executed with proxy priority would relinquish that priority when it drops the mutex, or before returning to userspace. Either way, I can't see a task being able to gain significant benefit for itself.

Just to be clear, you can't return to userspace with a kernel mutex held. This differs from, eg, a counting/binary semaphore or the page lock, both of which can be released from interrupt context.

Addressing priority inversion with proxy execution

Posted Jun 13, 2023 15:31 UTC (Tue) by nwatkins (guest, #61119) [Link]

It's fun to see our paper referenced in the commit series. My advisor the late Douglas Niehaus would be quite proud.