A proxy-execution baby step
The classic solution for priority inversion is priority inheritance; if a high-priority task finds itself blocked on a lock, it lends its priority to the lock holder, allowing the holder to progress and release the lock. Linux implements priority inheritance for the realtime scheduling classes, but that approach is not really applicable to the normal scheduling classes (where priorities are far more dynamic) or the deadline class (which has no priorities at all). So taking a different tack is called for.
That tack is proxy execution. While priority inheritance donates a task's priority to another, proxy execution also donates the waiting task's available CPU time. In short, if a high-priority ("donor") task finds itself waiting on a lock, the lock holder (the "proxy") is allowed to run in its place, using the donor's time on the CPU to get its work done. It is a relatively simple idea, but the implementation is anything but.
The next step
This patch series from John Stultz (containing the work of several developers) pushes the proxy-execution project one significant step forward. It starts by adding a new kernel configuration option, SCHED_PROXY_EXEC, to control whether the feature is built into the kernel. At this point, proxy execution is incompatible with realtime preemption, and with the extensible scheduler class as well, so the kernel cannot (yet) be built with all of those features enabled.
The kernel's massive task_struct structure, used to represent a task in the system, optionally contains a field called blocked_on. This field, which is only present if mutex debugging is enabled, tells the kernel which mutex (if any) a task is currently waiting for. The proxy-execution series makes this field unconditional, so that it is always available for the kernel to refer to. It provides a crucial link that lets the kernel determine which task needs to be allowed to run so that it can release the mutex in question.
One of the key changes in the overall proxy-execution project is the separation of "scheduling context" from "execution context". The scheduling context is essentially a task's position in the scheduler's run queue, while the execution context describes the task that actually runs when the scheduling context is selected for execution. In current kernels, the two contexts are always the same, but proxy execution will change that situation. One task's scheduling context may be chosen to run, but the holder of the lock that task is waiting for is the one that will get the CPU.
That leads to a bit of an accounting problem, though. The CPU time used by the execution context will be charged against the scheduling context — the proxy will burn a bit of the donor's time slice so that it can get its work done. But the total CPU time usage of the execution context should be increased to reflect the time it spends running in the proxy mode. That is the time value that is visible to user space; having it reflect the actual execution time of the task makes it clear that the task is, indeed, executing.
Normally, when a task finds itself blocked on a mutex, that task is deactivated (removed from the run queue) and not further considered by the scheduler until the mutex is released. Another important change made to support proxy execution is that a task that is blocked in this way is, instead, left on the run queue, but marked as being blocked. If the scheduler picks that task for execution, it will see the special flag, follow the blocked_on pointer to the lock in question, and from there it can find the owner of that lock. The lock owner can then be run in the blocked task's place.
That, at least, is the idea, but there are complications. For example, the lock-holding task may, itself, be blocked on a different lock. So the scheduler cannot just follow one pointer, it must be prepared to follow a chain of them until it finds something that can actually be run to push the whole chain forward. The current series includes an implementation of that logic. For extra fun, the situation could change while the scheduler is following that chain, so it must check at the end and, if it appears that the state of the relevant tasks has changed, bail out and restart from the beginning.
The EEVDF scheduler can, in some circumstances (described in this article), leave a task on the run queue even though that task has used its time slice and is not actually eligible to run. In the current patch series, if the lock holder turns out to be in this "deferred dequeue" state, the scheduler just gives up, deactivates the blocked task, and tries again. Dealing with this special case is just one of many details that have been left for future work.
The biggest limitation of the proxy-execution work, as seen in this patch series, comes about if the lock holder is running on a different CPU than the blocked task. There is a whole set of complications that are involved in this case, so the code doesn't even try. Unless the two tasks are running on the same CPU, the blocked task will, once again, be deactivated to wait in the old-fashioned way.
On a modern-day system — even on a small, mobile system — there are enough CPUs that the chances of both tasks running on the same one will be relatively small most of the time. That, in turn, means that proxy execution, in its state at the end of this patch series, is not yet a particularly useful feature for users. It is, however, useful for developers who are trying to understand this work; limiting proxy execution to same-CPU tasks makes the series much easier to review.
That review has been done, and this series is now staged to go upstream during the 6.17 merge window. Even if it is not a complete solution, it is a significant step toward that solution.
Donor migration
The next step can be seen in this patch series adding "donor migration". In simple terms, it handles the different-CPU case by migrating the donor task to the CPU where the lock holder is running. At that point, its scheduling context will be on the correct run queue to allow the proxying to happen.
Of course, nothing in the scheduler is quite that simple. System administrators can set CPU masks on groups of tasks that limit them to a subset of the CPUs on the system. So it may well be that the donor task cannot actually run on the CPU where the proxy is. Lending its scheduling context on that CPU is fine, but the scheduler has to take care to migrate the donor task back to a CPU it is allowed to run on once that task becomes runnable. It also would not do for the scheduler's load-balancing code to migrate either task somewhere else while the proxy execution is happening, so more care is required to disable migration in that case. Once again, there may be a whole chain of tasks involved; migrating all of them at the outset is more efficient than doing the job piecemeal.
Finally, there is the question of what happens when the mutex that caused all of this work is finally released. Having some unrelated task swoop in and grab it before the task that donated its CPU time gets a chance to do so seems unfair at best. It could also happen reasonably often, especially in situations where the donor task has to be migrated back to its original CPU before it can run. To avoid this problem, the mutex code is enhanced to recognize that a lock has been released as the result of proxy execution, and to hand the lock directly to the donor task in that case.
Getting this work into shape for merging could take a little while yet; the
current posting is the 20th version, and more are likely to come. Once it
is in, the problem still will not be completely solved, though the end will
be coming into sight. It looks like proxy execution will, eventually, as
the result of persistent effort by a number of developers, be a part of the
mainline kernel.
Posted Jul 31, 2025 4:29 UTC (Thu)
by nksingh (subscriber, #94354)
[Link]
Posted Aug 8, 2025 9:34 UTC (Fri)
by ReDress (guest, #174819)
[Link]
From the problem description, this sounds exactly like the problem PI futex solves but for userspace.
Lock convoy
PI futex for the kernel?