|
|
Log in / Subscribe / Register

The 7.0 scheduler regression that wasn't

By Jonathan Corbet
April 17, 2026
One of the more significant changes in the 7.0 kernel release is to use the lazy-preemption mode by default in the CPU scheduler. The scheduler developers have wanted to reduce the number of preemption modes for years, and lazy preemption looks like a step toward that goal. But then there came this report from Salvatore Dipietro that lazy preemption caused a 50% performance regression on a PostgreSQL benchmark. Investigation showed that the situation is not actually so grave, but the episode highlights just how sensitive some workloads can be to configuration changes; there may be surprises in store for other users as well.

One of the key decisions a CPU scheduler must make is when to remove a running process from the CPU to allow another to run. Preempting processes quickly when there is higher-priority work to do can produce quicker response times and, thus, lower latency. Aggressive preemption comes with a cost, though, in terms of the overall throughput of the system. Rapid switching of tasks can lead to more scheduler overhead, worse cache utilization, and more lock contention. It is hard to find a solution that works for every workload, a fact that has made it hard to remove the variety of preemption modes from the scheduler.

The lazy-preemption mode was designed with an eye toward the needs of both latency-sensitive and throughput-driven workloads. Unlike the full-preemption or realtime modes, lazy preemption will normally allow a task to run for a while even after the need for preemption has been detected. That preemption will be deferred until the task exhausts its time slice, blocks for some other reason, or until the next scheduler tick occurs. That leads to a quicker preemption than would happen with the PREEMPT_NONE mode (which only preempts a process at the end of its time slice), but still allows the task to run for a while before the preemption occurs.

Dipietro reported that the PostgreSQL performance regression was caused by a large increase in lock contention. PostgreSQL uses user-space spinlocks for much of its concurrency control; one problem with such locks is that, if a lock holder is preempted before the lock can be released, other processes will spin on a lock that may remain held for a long time. An increase in the frequency of preemption could indeed cause this to happen; more preemptions mean more chances to sideline a process before it is able to release a contended lock.

At a first glance, that seemed to be exactly what was happening here, leading scheduler developer Peter Zijlstra to suggest that the proper fix was for PostgreSQL to use time-slice extension to protect lock holders from preemption. This feature allows a process to request that it not be preempted for a short period while it completes the execution of a critical section and releases its locks. It is a useful feature for a situation like this but, as PostgreSQL developer Andres Freund pointed out, time-slice extension was only added in the 7.0 kernel; "requiring the use of a new low level facility that was introduced in the 7.0 kernel, to address a regression that exists only in 7.0+, seems not great". It would also not be a simple change, he said, so backporting any such fix to released versions of PostgreSQL was unlikely to happen.

Zijlstra, faced with the prospect of having to revert a scheduler change that had been years in the making, was clearly reluctant to do so. He suggested that anybody who updates the kernel on a system running PostgreSQL could be expected to update the database manager as well. This is the sort of forced update scenario that the kernel's regression policy is meant to avoid, but Zijlstra remarked "sometimes you have to break eggs to make cake :-)". If a revert was needed, he said, it would be "a very temporary thing". The plan is to eventually remove PREEMPT_NONE entirely, eliminating a fair amount of complexity in the scheduler.

Meanwhile, though, Freund was unable to reproduce the problem, and had a hard time understanding how it could come about. A little while later, though, he figured it out. In his test systems, he had enabled the use of transparent huge pages (THPs), "as that is the only sane thing to do with 10s to 100s of GB of shared memory and thus part of all my benchmarking infrastructure". When he disabled huge pages, the problem reported by Dipietro surfaced immediately. That revelation removed the urgency from this regression:

I don't see a reason to particularly care about the regression if that's the sole way to trigger it. Using a buffer pool of ~100GB without huge pages is not an interesting workload. With a smaller buffer pool the problem would not happen either.

He added that, even in the absence of the spinlock contention, avoiding huge pages was going to have bad performance effects.

Freund had expressed confusion about how there could be contention on the lock that Dipietro pointed out, since the critical section it protects is quite short. But, when huge pages are not in use, that section will take longer to execute. The extra pressure on the translation lookaside buffer (TLB) caused by using small pages will be a part of the problem, but a bigger part is almost certainly just the greatly increased number of page faults that will occur in that configuration. These effects will increase the execution time in the critical section, increasing the chances that a PREEMPT_LAZY kernel will take control away from a lock-holding process. That slowdown is far less likely to happen when huge pages are in use.

One conclusion from that diagnosis is that time-slice extension would be of little help; Freund confirmed that the performance regression happened even when user-space spinlocks are taken out of the picture. That said, he did acknowledge that the feature was worth looking into on its own merits, saying it looks "nice for performance regardless of using spinlocks".

Dipietro confirmed that enabling huge pages caused the regression to disappear. With that report, thoughts of reverting the scheduler change also seemed to disappear. That may be a bit premature, though. There are likely to be systems in the wild running under less-than-optimal configurations that will show regressions when hit with this kind of change. That prospect, in turn, may cause distributors to shy away from lazy preemption in their kernels, regardless of what the scheduler developers might like. An immediate revert might not be in the cards, but the grand plan to remove PREEMPT_NONE may have a longer path to completion than some would like.

Index entries for this article
KernelPreemption
KernelScheduler/Preemption models


to post comments

transparent hugepages preventing performance regressions instead of causing them

Posted Apr 17, 2026 14:45 UTC (Fri) by vbabka (subscriber, #91706) [Link] (8 responses)

sure wasn't in my 2026 bingo card!

transparent hugepages preventing performance regressions instead of causing them

Posted Apr 18, 2026 2:01 UTC (Sat) by page_walker (subscriber, #99801) [Link] (7 responses)

I needed to read it multiple times to confirm THP is now a good thing. :)

transparent hugepages preventing performance regressions instead of causing them

Posted Apr 18, 2026 5:48 UTC (Sat) by andresfreund (subscriber, #69562) [Link] (6 responses)

It doesn't have an effect in this case, my comments / benchmark result were about using explicit huge pages. The regression reproduces when THPs are enabled, but explicit huge pages are not used. The entire contention is due to page faults during the first access to a page, in a workload with substantial memory pressure, I don't think it's likely that THP will succeed often / quick enough to matter.

transparent hugepages preventing performance regressions instead of causing them

Posted Apr 18, 2026 13:51 UTC (Sat) by willy (subscriber, #9762) [Link]

khugepaged is unlikely to be of help, but we can allocate THPs by default these days given appropriate calls to madvise() and similar.

It will fall back to lower orders, probably more easily than you'd prefer.

transparent hugepages preventing performance regressions instead of causing them

Posted Apr 20, 2026 1:35 UTC (Mon) by mmeehan (subscriber, #72524) [Link] (4 responses)

Thinking about transparent huge page allocation, there are cases where default allocation page size could flip to 2 MB by default. For example, for malloc calls over some size they could be transparently backed by huge pages straight away. Would that have helped for this workload?

transparent hugepages preventing performance regressions instead of causing them

Posted Apr 20, 2026 2:03 UTC (Mon) by andresfreund (subscriber, #69562) [Link] (3 responses)

> Would that have helped for this workload?

Possibly. But it's worth noting that that workload is ... not realistic, as it tests an absurd amount of concurrency, with a cold cache, a huge amount of memory. This is not a bottleneck that I've seen outside of intentional torture testing.

I am somewhat sceptical that relying on THPs for a 100GB database buffer pool is a good idea or will be one in the near future.

transparent hugepages preventing performance regressions instead of causing them

Posted Apr 20, 2026 3:14 UTC (Mon) by mmeehan (subscriber, #72524) [Link] (2 responses)

It's more like, if you malloc 10 GB in one go then the kernel may as well give you 1 GB or 2 MB pages transparently if the buddy allocator has contiguous blocks of memory of that size around. If not, you fall back to 4 KB pages. Doesn't seem like it would be more overhead to allocate them this way outright, rather than satisfy the malloc with 4 KB pages and try to mash them together later. This would lead to most big allocations being huge page backed outright, until fragmentation kicked in.

transparent hugepages preventing performance regressions instead of causing them

Posted Apr 20, 2026 11:38 UTC (Mon) by daroc (editor, #160859) [Link] (1 responses)

I think that idea may not play well with lazy memory allocation — the kernel doesn't actually allocate memory for a program's use until it touches a page, which is an important optimization because plenty of programs ask for more memory than they actually use.

transparent hugepages preventing performance regressions instead of causing them

Posted Apr 20, 2026 12:20 UTC (Mon) by vbabka (subscriber, #91706) [Link]

Indeed. But even without such preallocation, the first access in every 2MB aligned sub-range of that large range would attempt to give you a 2MB THP, not a 4kB base page. That's what "always" means here and it's AFAIK default:

# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

If that 2MB page isn't readily available (memory is low, fragmented), then it depends if the system will try reclaim/compaction or not, depending on this:

# cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise [madvise] never

And AFAIK the default is madvise, so only regions subjected to madvise(MADV_HUGEPAGE) will try reclaim/compaction.

why did PREEMPT_LAZY caused more preemptions than PREEMPT_NONE with THP disabled?

Posted Apr 18, 2026 2:22 UTC (Sat) by rharjani (subscriber, #87278) [Link]

I guess from above it still wasn't fully clear why PREEMPT_LAZY caused more preemptions than PREEMPT_NONE with THP disabled (or maybe it was obvious for few).

My assumption there is -

With THP disabled, the workload takes many more page faults which serializes on the page table spinlock. This means more preempt_disable() / preempt_enable() pairs for this type of workload. Since preempt_enable() is a preemption point under PREEMPT_LAZY but a no-op under PREEMPT_NONE, maybe this gives far more preemption opportunities under PREEMPT_LAZY (especially since the workload was CPU-intensive and ran well past the scheduler tick that would have escalated TIF_NEED_RESCHED_LAZY to TIF_NEED_RESCHED)

not THP, explicit huge pages

Posted Apr 18, 2026 5:43 UTC (Sat) by andresfreund (subscriber, #69562) [Link] (2 responses)

Note that the fix isn't to use THP, it's using explicit huge pages. I think the workload in question would be unlikely to be helped by THP, as the entire problem only happens during warmup, when a page in postgres' buffer pool is used for the first time.

> PostgreSQL uses user-space spinlocks for much of its concurrency control

Fwiw, that's not really true. Almost everything is protected by locks that use futexes under the hood, it's just much less commonly contended state, or state that is extremely crucial, where futexes are just not competitive (exactly one spinlock) due to the additional memory barrier during lock release.

not THP, explicit huge pages

Posted Apr 19, 2026 6:48 UTC (Sun) by mokki (subscriber, #33200) [Link] (1 responses)

PostgreSQL could add an option to pre touch all shared memory on startup (in parallel). After that the differences between THP and real huge pages are at least somewhat reduced.

This is what java +UseTransparentHugePages and +AlwaysPreTouch combination does. In cases where one cannot get real huge pages it is very convenient (inside many containers, or without root access).

not THP, explicit huge pages

Posted Apr 27, 2026 15:45 UTC (Mon) by psoo (subscriber, #94343) [Link]

There's pg_prewarm in contrib which helps warming up the shared_buffer pool.


Copyright © 2026, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds