In mid-September, the 3.6 kernel appeared to be stabilizing nicely. Most
of the known regressions had been fixed, the patch volume was dropping, and
Linus was relatively happy. Then Nikolay Ulyanitsky showed up with
: the pgbench
benchmark ran 20% slower than under 3.5. The resulting discussion shows
just how hard scalability can be on contemporary hardware and how hard
scheduling can be in general.
Borislav Petkov was able to reproduce the problem; a dozen or so bisection
iterations later he narrowed down the problem to this
patch, which was duly reverted. There is just one little problem left: the
offending patch was, itself, meant to improve scheduler performance.
Reverting it fixed the PostgreSQL regression, but at the cost of losing an
optimization that improves things for many (arguably most) other
workloads. Naturally, that led to a search to figure out what the real
problem was so that the optimization could be restored without harmful
effects on PostgreSQL.
What went wrong
The kernel's scheduling domains mechanism
exists to optimize scheduling decisions by modeling the costs of moving
processes between CPUs. Migrating a process from one CPU to a
hyperthreaded sibling is nearly free; cache is shared at all levels, so the
moved process will not have to spend time repopulating cache with its
working set. Moving to another CPU within the same physical package will
cost more, but mid-level caches are still shared, so such a move is still
much less expensive than moving to another package entirely. The current
scheduling code thus tries to keep processes within the same package
whenever possible, but it also tries to spread runnable processes across the
package's CPUs to maximize throughput.
The problem that the offending patch (by Mike Galbraith) was trying to
solve comes from the fact that the number of CPUs built into a single
package has been growing over time. Not too long ago, examining every
processor within a package in search of an idle CPU for a runnable process
was a relatively quick affair. As the number of CPUs in a package
increases, the cost of that search increases as well, to the point that it
starts to look expensive. The current scheduler's behavior, Mike said at
the time, could also result in processes bouncing around the package
excessively. The result was less-than-optimal performance.
Mike's solution was to organize CPUs into pairs; each CPU gets one "buddy"
CPU. When one CPU wakes a process and needs to find a processor for that
process to run on, it examines only the buddy CPU. The process will be
placed on either the original CPU or the buddy; the search will go no
further than that even if there might be a more lightly loaded CPU
elsewhere in the package. The cost of iterating
over the entire package is eliminated, process bouncing is reduced, and
things run faster. Meanwhile, the scheduler's load balancing code can
still be relied upon to distribute the load across the available CPUs in
the longer term. Mike reported significant improvements in tbench
benchmark results with the patch, and it was quickly accepted for the 3.6
So what is different about PostgreSQL that caused it to slow down in
response to this change? It seems to come down to the design of the
PostgreSQL server and the fact that it does a certain amount of its own
scheduling with user-space spinlocks. Carrying its own spinlock
implementation does evidently yield performance benefits for the PostgreSQL
project, but it also makes the system more susceptible to problems
resulting from scheduler changes in the underlying system.
In this case, restricting the set of CPUs on which a newly-woken process
can run increases the chance that it will end up preempting another
PostgreSQL process. If the new process needs a lock held by the preempted
process, it will end up waiting until the preempted processes manages to
run again, slowing things down. Possibly even worse is that preempting the
process — also more likely with Mike's patch — can slow the flow of tasks
to all PostgreSQL worker processes; that, too, will hurt performance.
Making things better
What is needed is a way to gain the benefits of Mike's patch without making
things worse for PostgreSQL-style loads. One possibility, suggested by Linus, is to try to reduce the
cost of searching for an idle CPU instead of eliminating the search
outright. It appears that there is some low-hanging fruit in this area,
but it is not at all clear that optimizing the search, by itself, will
solve the entire problem. Mike's patch eliminates that search cost, but it
also reduces movement of processes around the package; a fix that only
addresses the first part risks falling short in the end.
Another possibility is to simply increase the scheduling granularity,
essentially giving longer time slices to running processes. That will
reduce the number of preemptions, making it less likely that PostgreSQL
processes will step on each other's toes.
granularity does, indeed, make things better for the pgbench
load. There may be some benefit to be had from messing with the
granularity, but it is not without its risks. In particular, increasing
the granularity could have an adverse effect on desktop interactivity;
there is no shortage of Linux users who would consider that to be a bad
Yet another possibility is to somehow teach the scheduler to recognize
processes — like the PostgreSQL dispatcher — that should not be preempted
by related processes if it can be avoided. Ingo Molnar suggested investigating this idea:
Add a kernel solution to somehow identify 'central' processes and
bias them. Xorg is a similar kind of process, so it would help
other workloads as well. That way lie dragons, but might be worth
an attempt or two.
The problem, of course, is the dragons. The O(1) scheduler, used by Linux
until the Completely Fair Scheduler (CFS) was merged for 2.6.23, had, over
time, accumulated no end of heuristics and hacks designed to provide the
"right" kind of scheduling for various types of workloads. All these
tweaks complicated the scheduler code considerably, making it fragile and
difficult to work with — and they didn't even work much of the time. This
complexity inspired Con Kolivas's "staircase deadline scheduler" as a much
simpler solution to the problem; that work led to the writing (and merging)
Naturally, CFS has lost a fair amount of its simplicity since it was merged;
contact with the real world tends to do that to scheduling algorithms. But
it is still relatively free of workload-specific heuristics. Opening the
door to more of them now risks driving the scheduler in a less
maintainable, more brittle direction where nothing can be done without
a significant chance of creating problems in unpredictable places. It
seems unlikely that the development community wants to go there.
A potentially simpler alternative is to let the application itself tell the
scheduler that one of its processes is special. PostgreSQL could request
that its dispatcher be allowed to run at the expense of one of its own
workers, even if the normal scheduling algorithm would dictate otherwise.
That approach reduces complexity, but it does so by pushing some of the
cost into applications. Getting application developers to accept that cost
can be a challenge, especially if they are interested in supporting
operating systems other than Linux. As a general rule, it is far better if
things just work without the need for manual intervention of this type.
So, in other words, nobody really knows how this problem will be solved at
this time. There are several interesting ideas to pursue, but none that
seem like an obvious solution. Further research is clearly called for.
One good point in all of this is that the problem was found before the
final 3.6 kernel shipped. Performance regressions have a way of hiding,
sometimes for years, before they emerge to bite some important workload.
Eventually, tools like Linsched may help to
find more of these problems early, but we will always be dependent on users
who will perform this kind of testing with workloads that matter to them.
Without Nikolay's 3.6-rc testing, PostgreSQL users might have had an
unpleasant surprise when this kernel was released.
to post comments)