Addressing priority inversion with proxy execution
To understand priority inversion, imagine that a low-priority, background task acquires a mutex. If a realtime task happens to need that same mutex, it will find itself blocked, waiting for the low-priority task to let go of it. Should yet another task, with medium priority, come along, it may prevent the low-priority task from executing at all, meaning that the mutex will not be released and the realtime task will be blocked indefinitely. That is exactly the sort of outcome that the priority mechanism is intended to prevent.
A classic solution to priority inversion is priority inheritance. If a high-priority task finds itself blocked on a resource held by another, it lends its priority to the owning task, allowing that task to complete its work and release the resource. The Linux kernel has supported priority inheritance for a long time, but that is not a complete solution to the problem. Deadline scheduling complicates the situation, in that it is not priority based. Since a task running in the deadline class has no priority, it cannot lend that priority to another task. So priority inheritance will not work with tasks using deadline scheduling.
Kernel developers have been working on this problem for some time; it was discussed at the 2019 and 2020 scheduling and power management (OSPM) conferences, for example. The current patch set, posted by John Stultz but containing the work of a number of developers, shows the current state of this work. At its core, "proxy execution" involves letting a blocked process lend its entire scheduling context to another task holding a needed resource.
To be able to implement proxy execution, the scheduler needs to know exactly which resource a blocked task is waiting for. The task_struct structure already contains a struct mutex pointer called blocked_on that serves exactly this purpose but, in current kernels, it is only compiled in if mutex debugging is enabled. The patch series makes this field unconditional so that this tracking is always performed. The mutex structure already has a pointer to the task that owns it at any given time; the patch series makes that pointer available to the scheduler. The combination of these two pointers allows the scheduler to locate the task holding the resource needed by another task.
The task_struct structure contains a vast amount of information about a running task. The patch series recognizes that this information serves two different roles relevant to scheduling: the execution context and the scheduling context. The execution context contains the information needed to run a given task, while the scheduling context describes how the task will be treated by the CPU scheduler. To enable a logical separation of these two roles, the rq (run queue) structure gains a second task_struct pointer for the scheduling context. Most of the time, the execution and scheduling contexts for a given run-queue entry will be the same, but proxy execution may cause them to differ.
The scheduler's run queues hold tasks that are in a runnable state — they would be on a CPU if one were available for them. When a task blocks to wait for a resource, it is removed from the run queue until it becomes runnable again. One of the more interesting changes made by this patch set is to cause blocked tasks to remain on the run queue, even though they are not, in fact, runnable. That causes the scheduler to pick the first task that it would run, assuming its resources were available, rather than the first task that it can run.
This mechanism may thus leave the scheduler trying to run a task that can't actually run; this is the time for the scheduler to give the CPU to the task holding the resource blocking the execution of the task that the scheduler really wants to run. With the infrastructure described above, implementing this proxy execution is conceptually simple. If the chosen task is not runnable, then follow its blocked_on pointer to find the task it's waiting for, give that task the blocked task's scheduling context (thus boosting its position in the run queue), and run it instead. When the boosted task releases the mutex it is holding, it will lose the other task's scheduling context, and the higher-priority task will be able to continue. Problem solved.
Naturally, there are a few complications. The task holding the needed mutex may, itself, be blocked on yet another resource, so the scheduler will need to be able to follow a chain of blocked-on relationships. A scheduling context may include a constraint on which CPUs may be used, so a task running as a proxy may need to be migrated to a different CPU first. The scheduler has to keep proxy execution in mind before deciding to migrate a task to another CPU as part of its normal load balancing. CPU-time accounting also becomes more complex; the time used by a task while running as a proxy for another should be charged to the running task, but it is taken from the higher-priority task's time slice to maintain scheduling fairness.
The kernel normally tries hard to spread realtime and deadline tasks across
the system's CPUs so that all of them can run, but proxy execution binds
the tasks involved onto the same CPU. If one of them is to be migrated to
achieve the needed separation, both must be — and here, too, there may be a
chain of blocked tasks to worry about. One
of the most complex patches in the series attempts to solve this
problem. Rather than create "some sort of complex data structure
"
to track the ability to move tasks, it changes the load-balancing code to
simply search through the list of potentially movable tasks. The idea here
is that, once the behavior is seen to be correct, optimizations can be
applied.
The patch series has not received any review comments as of this writing;
all reviewers, it seems, are blocked on other tasks. Given the complexity
and long history of this work, though, it seems unlikely that this version
will be the last one. Even seemingly simple changes can be hard to apply
to the CPU scheduler without creating subtle problems, and this change is
not simple.
| Index entries for this article | |
|---|---|
| Kernel | Scheduler/Proxy execution |
Posted Jun 11, 2023 17:06 UTC (Sun)
by bredelings (subscriber, #53082)
[Link] (1 responses)
At a higher level, there seems to be a lot of faith placed in other tasks that hold mutexes that giving them more CPU time leads to some kind of forward progress towards dropping the mutex, and that tasks hold mutexes for the minimum possible code section. Maybe the whole mutex approach already assumes that kind of trust though.
Not sure I'm understanding the implications here. I'm sure people have thought about it already -- would be interested to hear
Posted Jun 11, 2023 17:16 UTC (Sun)
by willy (subscriber, #9762)
[Link]
Just to be clear, you can't return to userspace with a kernel mutex held. This differs from, eg, a counting/binary semaphore or the page lock, both of which can be released from interrupt context.
Posted Jun 13, 2023 15:31 UTC (Tue)
by nwatkins (guest, #61119)
[Link]
Addressing priority inversion with proxy execution
Addressing priority inversion with proxy execution
Addressing priority inversion with proxy execution
