The 7.0 scheduler regression that wasn't
One of the key decisions a CPU scheduler must make is when to remove a running process from the CPU to allow another to run. Preempting processes quickly when there is higher-priority work to do can produce quicker response times and, thus, lower latency. Aggressive preemption comes with a cost, though, in terms of the overall throughput of the system. Rapid switching of tasks can lead to more scheduler overhead, worse cache utilization, and more lock contention. It is hard to find a solution that works for every workload, a fact that has made it hard to remove the variety of preemption modes from the scheduler.
The lazy-preemption mode was designed with an eye toward the needs of both latency-sensitive and throughput-driven workloads. Unlike the full-preemption or realtime modes, lazy preemption will normally allow a task to run for a while even after the need for preemption has been detected. That preemption will be deferred until the task exhausts its time slice, blocks for some other reason, or until the next scheduler tick occurs. That leads to a quicker preemption than would happen with the PREEMPT_NONE mode (which only preempts a process at the end of its time slice), but still allows the task to run for a while before the preemption occurs.
Dipietro reported that the PostgreSQL performance regression was caused by a large increase in lock contention. PostgreSQL uses user-space spinlocks for much of its concurrency control; one problem with such locks is that, if a lock holder is preempted before the lock can be released, other processes will spin on a lock that may remain held for a long time. An increase in the frequency of preemption could indeed cause this to happen; more preemptions mean more chances to sideline a process before it is able to release a contended lock.
At a first glance, that seemed to be exactly what was happening here,
leading scheduler developer Peter Zijlstra to suggest
that the proper fix was for PostgreSQL to use time-slice extension to protect lock holders
from preemption. This feature allows a process to request that it not be
preempted for a short period while it completes the execution of a critical
section and releases its locks. It is a useful feature for a situation
like this but, as PostgreSQL developer Andres Freund pointed
out, time-slice extension was only added in the 7.0 kernel;
"requiring the use of a new low level facility that was introduced in
the 7.0 kernel, to address a regression that exists only in 7.0+, seems not
great
". It would also not be a simple change, he said, so backporting
any such fix to released versions of PostgreSQL was unlikely to happen.
Zijlstra, faced with the prospect of having to revert a scheduler change
that had been years in the making, was
clearly reluctant to do so. He suggested that anybody who updates the
kernel on a system running PostgreSQL could be expected to update the
database manager as well. This is the sort of forced update scenario that
the kernel's regression policy is meant to avoid, but Zijlstra remarked
"sometimes you have to break eggs to make cake :-)
". If a revert was
needed, he said, it would be "a very temporary thing
". The plan is
to eventually remove PREEMPT_NONE entirely, eliminating a fair
amount of complexity in the scheduler.
Meanwhile, though, Freund was
unable to reproduce the problem, and had a hard time understanding how
it could come about. A little while later, though, he
figured it out. In his test systems, he had enabled the use of
transparent huge pages (THPs), "as
that is the only sane thing to do with 10s to 100s of GB of shared memory
and thus part of all my benchmarking infrastructure
". When he disabled
huge pages, the problem reported by Dipietro surfaced immediately. That
revelation removed the urgency from this regression:
I don't see a reason to particularly care about the regression if that's the sole way to trigger it. Using a buffer pool of ~100GB without huge pages is not an interesting workload. With a smaller buffer pool the problem would not happen either.
He added that, even in the absence of the spinlock contention, avoiding huge pages was going to have bad performance effects.
Freund had expressed confusion about how there could be contention on the lock that Dipietro pointed out, since the critical section it protects is quite short. But, when huge pages are not in use, that section will take longer to execute. The extra pressure on the translation lookaside buffer (TLB) caused by using small pages will be a part of the problem, but a bigger part is almost certainly just the greatly increased number of page faults that will occur in that configuration. These effects will increase the execution time in the critical section, increasing the chances that a PREEMPT_LAZY kernel will take control away from a lock-holding process. That slowdown is far less likely to happen when huge pages are in use.
One conclusion from that diagnosis is that time-slice extension would be of
little help; Freund confirmed
that the performance regression happened even when user-space spinlocks are
taken out of the picture. That said, he did acknowledge that the feature
was worth looking into on its own merits, saying it looks "nice for
performance regardless of using spinlocks
".
Dipietro confirmed that
enabling huge pages caused the regression to disappear. With that report,
thoughts of reverting the scheduler change also seemed to disappear. That
may be a bit premature, though. There are likely to be systems in the wild
running under less-than-optimal configurations that will show regressions
when hit with this kind of change. That prospect, in turn, may cause
distributors to shy away from lazy preemption in their kernels, regardless
of what the scheduler developers might like. An immediate revert might not
be in the cards, but the grand plan to remove PREEMPT_NONE may
have a longer path to completion than some would like.
| Index entries for this article | |
|---|---|
| Kernel | Preemption |
| Kernel | Scheduler/Preemption models |
