LWN: Comments on "How 3.6 nearly broke PostgreSQL"

How 3.6 nearly broke PostgreSQL

rodgerd — Sat, 13 Oct 2012 07:35:57 +0000

Not just the ARM world, really; consider the turbo mode available with newer Intel and AMD processors. If you're doing work that runs better with fewer, faster threads rather than more slower ones, you'd want to aggregate the workload onto fewer cores and let the CPU ratchet the clock up.

How 3.6 nearly broke PostgreSQL

bjmaz — Fri, 12 Oct 2012 01:27:38 +0000

Seems to me a solution to this problem could be to dedicate a core to scheduling. This, of course, would assume a many-core, homogenous system. The scheduler could then keep track of free resources/running processes.

How 3.6 nearly broke PostgreSQL

bbulkow — Thu, 11 Oct 2012 21:29:13 +0000

I'm somewhat pained by the "we can't make the desktop less responsive" argument.

As a high performance server author (currently http://Aerospike.com/ ), I am working hard to avoid context switch overhead in a few places. We are getting hurt by core scheduling issues - we often recommend turning off hyperthreading.

I have tried signalling to the OS heuristic I think make my server go faster. I've used sched_setscheduler(), which seemed to be effective before CFS but makes no difference in "modern" kernels.

As Linux is more beloved as a server OS, what's so bad about having known tuning parameters / API calls to change the scheduling heuristics? Is sched_setscheduler() to be avoided?

How 3.6 nearly broke PostgreSQL

eternaleye — Sat, 06 Oct 2012 08:45:06 +0000

Alright, here's a version that has a Kautz::Graph class, and prints both nodes and edges when run: http://ix.io/36k/

How 3.6 nearly broke PostgreSQL

eternaleye — Sat, 06 Oct 2012 08:26:35 +0000

...And I just realized I misread my own code. It actually prints the names of the nodes, it's just that the symbols of the name are space-separated since they can be any positive integer <= degree.

Well, one more thing to remedy.

How 3.6 nearly broke PostgreSQL

eternaleye — Sat, 06 Oct 2012 08:13:48 +0000

Another option might be to use a Kautz digraph - it seems like in terms of a migration problem like this, it might be very nearly optimal.

I've found this to be a far clearer explanation than the Wikipedia article: http://pl.atyp.us/wordpress/index.php/2007/12/the-kautz-g...

In learning about them myself, I wrote a perl script to generate a Kautz graph: http://ix.io/36j/

It takes the degree as the first argument and the dimension and the second, and outputs a list of edges as <from> <to> tuples, one per line.

I still need to write the Kautz::Graph class (stubbed in the file above) to embody a full set of the Kautz nodes with the same degree and dimension parameters, and see about modifying it to generate dot so it can make a nice graphic.

How 3.6 nearly broke PostgreSQL

jhoblitt — Fri, 05 Oct 2012 20:45:40 +0000

Couldn't libpthreads notify the kernel that pthread_spin_lock() is in use? You certainly wouldn't want to introduce a system call on every use but having to branch on the first use by a given pid shouldn't be too expensive.

How 3.6 nearly broke PostgreSQL

amit.kucheria — Fri, 05 Oct 2012 10:28:30 +0000

Except when you're trying to optimize for power, which we're trying to do in the ARM world. :)

Consolidating work onto fewer cores[1] is something we're actively looking into.

[1] https://blueprints.launchpad.net/linaro-power-kernel/+spe...

How 3.6 nearly broke PostgreSQL

dunlapg — Thu, 04 Oct 2012 10:46:22 +0000

Yeah, it seems like if the problem is there are "too many" cores per socket, that instead of fixing on 2, it should be fixed on "almost but not quite too many" -- e.g., 4 for example? So if you have 4-cores per socket, it looks exactly like it does before; but if you have 8, you still only look at 4 other cores. Having the sets overlap seems like an obvious way to make sure things can find a good optimum within a few iterations.

How 3.6 nearly broke PostgreSQL

martinfick — Wed, 03 Oct 2012 19:09:27 +0000

Agreed. My limited opinion is simply that spinning is a sometimes neccessary evil; avoid it whenever possible; use a kernel supported mechanism if you expect to get consistent kernel behavior.

How 3.6 nearly broke PostgreSQL

corbet — Wed, 03 Oct 2012 19:00:23 +0000

FWIW, the discussion in kernelland was based on the assumption that this regression was the kernel's problem. Nobody there suggested telling the PostgreSQL developers to come up with a new locking scheme.

How 3.6 nearly broke PostgreSQL

jberkus — Wed, 03 Oct 2012 18:56:45 +0000

> My thought when I read about this problem is that PostgreSQL is causing it with their user-space locking and so PostgreSQL needs to fix it.

So there's a couple major problems with the idea that this issue should be fixed in the PostgreSQL code:

1. PostgreSQL is not the only application with its own spinlock implementatation. There is, for example, Oracle.

2. While we might be able to change locking in future versions of PostgreSQL, we can't change locking in past ones.

Even if the next version of PostgreSQL (9.3) has a modified locking implementation which doesn't hit the issues in 3.6, the number of people running older versions of PostgreSQL will far outnumber the folks running the latest version for quite some time. What you'd be saying to all of those folks is, effectively, "don't upgrade Linux".

So any solution which hinges on "make these modifications to PostgreSQL" will instead result in PostgreSQL users deciding to stay on old versions of the Kernel. If this problem is equally bad for Oracle, you might even see RedHat refusing to deploy a version based on 3.6.

There's also the fact that PostgreSQL's locking implementation is complex and quite highly tuned. So the idea that the PostgreSQL project could make changes which wouldn't result in a worse regression in a few months is optimistic. Implementing things like futex support could take literally years.

How 3.6 nearly broke PostgreSQL

cmccabe — Wed, 03 Oct 2012 18:22:37 +0000

It seems like anyone who uses pthread_spin_lock or a similar thing could run into this problem. Whenever one thread waits on another, without telling the kernel, the kernel might make a scheduling decision that ends up being suboptimal (like putting them both on the same core.)

How 3.6 nearly broke PostgreSQL

and — Wed, 03 Oct 2012 13:31:13 +0000

I understood that approach differently: If core A has B as its buddy, B's buddy does not necessarily need to be A. This would imply kind of a ring buffer of buddies within the package (A->B->C->D->A)...

How 3.6 nearly broke PostgreSQL

andresfreund — Wed, 03 Oct 2012 12:40:02 +0000

> Possibly even worse is that preempting the PostgreSQL dispatcher process — also more likely with Mike's patch — can slow the flow of tasks to all PostgreSQL worker processes; that, too, will hurt performance.
The thing you could describe as a dispatcher, the "postmaster" process, which forks to create individual backends after receiving a connection requests doesn't do any relevant locking. So preemting it should be mostly harmless.
There are other processes though which might have something like the effect described above. Namely the 'wal writer', 'checkpointer' (9.2+, included in bgwriter before) and bgwriter processes. Especially if you have a write intensive concurrent workload interactions between individual connection backends and the wal writer can get contended. So its somewhat likely that preemption might get painful there.

Postgres is *far* from the only application doing its own spinlock implementation and contending on the locks every now and then. I am pretty sure that if the patch had made its way into 3.6 more performane regression reports would have come in.

How 3.6 nearly broke PostgreSQL

rvfh — Wed, 03 Oct 2012 12:36:21 +0000

Unless I misread the code, it's more or less what happens now: if a core becomes idle and no load balancer exists in the system, it will start assuming that role and try to get some work to do.

How 3.6 nearly broke PostgreSQL

ajb — Wed, 03 Oct 2012 12:03:39 +0000

Your example has convinced me that I was wrong to say that you could get away with only two buddies. If you have only two, the best you can do is have a ring of cpus. This is different from the case of cuckoo hashing and skew caches, where only two places are sufficient.

However, in your example, A and C have 3 buddies, but B and D have only two. It seems like the scheduling algorithm would be just as efficient if they all had three, and there would be a better chance that migrations would be beneficial.

How 3.6 nearly broke PostgreSQL

Zenith — Wed, 03 Oct 2012 08:07:24 +0000

I am not familiar with skew caches or cuckoo hashing, so I may just be restating your solution.

Would a better solution than the one suggested by Mike not be to have 3 cores in each set, and then make the sets overlap.

So say we have a 4-core CPU (labelled A, B, C, and D), that then gets divided into two sets X = (A, B, C) and Y = (A, C, D).
That way the load can migrate from one set (X) of CPU's into another set (Y), provided that the load migration code gets lucky on selecting set Y.

How 3.6 nearly broke PostgreSQL

ajb — Wed, 03 Oct 2012 07:31:45 +0000

Having only one 'buddy' core divides the set of cores into isolated pairs, Checking two would allow processes to move anywhere within the set, as in a skew cache or cuckoo hashing. Maybe that would work better.

How 3.6 nearly broke PostgreSQL

fdr — Wed, 03 Oct 2012 04:30:18 +0000

I think someone wrote up a prototype (actually, I think that may be several across many years, this being the latest incarnation I know of):

http://archives.postgresql.org/pgsql-hackers/2012-06/msg0...

This actually uses futexes indirectly, in my understanding from the post, and it's not the most awful thing for Linux.

It's possible that Linux futexes are not a bad idea (but it's also not clearly a huge improvement), but s_lock does okay and has existed an awfully long time (with iteration), so there's some inertia there.

Also, I think I oversimplified the parent's post, he could have simply meant that something about s_lock is not very good, as opposed to "self-rolled user space spinlocks? That's ridiculous!" (as a straw man). It's possible s_lock could be improved, however, it was fine before and doesn't clearly seem at fault right now. I think the suggestion of having to manhandle the platform-specific scheduler also seems excessive. There may be a better solution to serve everyone; somehow I can't see PostgreSQL's spinlocks being one-of-a-kind in this pathology, but I haven't attempted to prove that.

How 3.6 nearly broke PostgreSQL

josh — Wed, 03 Oct 2012 03:19:16 +0000

How much faster is PostgreSQL's userspace locking than futexes? In theory, futexes have no kernel overhead when acquired uncontended, and minimal overhead (just the overhead of blocking in the scheduler) when contended. The only case I can think of that would have less overhead would involve busy-waiting in userspace.

How 3.6 nearly broke PostgreSQL

fdr — Wed, 03 Oct 2012 02:34:52 +0000

There is no user-space locking that is portable, bug-free, and fast enough to satisfy PostgreSQL on its supported platforms. So, as far as I know, this is pretty much off the table, and there are good reasons for that.

How 3.6 nearly broke PostgreSQL

ncm — Wed, 03 Oct 2012 02:13:25 +0000

It seems to me that a more lightly-loaded core should be put to work first recruiting work from its more heavily-loaded neighbors, offloading both scheduling activity and finally other processes from them. Is there something very cheap that a busy core could do that allows its neighbors to make better decisions about what work to take away from it?

How 3.6 nearly broke PostgreSQL

zlynx — Wed, 03 Oct 2012 01:01:33 +0000

I agree.

My thought when I read about this problem is that PostgreSQL is causing it with their user-space locking and so PostgreSQL needs to fix it. Perhaps it should pin its processes to particular cores, ensuring that it spreads out across all the cores. It could also nice its worker processes while leaving the dispatcher at normal. Or it could negative-nice (would that be making the process mean?) its dispatcher.

How 3.6 nearly broke PostgreSQL

xorbe — Tue, 02 Oct 2012 23:43:41 +0000

Linus has already noted the right solution: one can't just throw out possible cpu execution resources. We've been through this before. For the masses, a reliable and predictable scheduler is far better than one that makes 99% of things run a touch faster, but hoses the remaining 1% -- eventually it snags some major oft-used package. The other proposed ideas in the article seem crazy. Leave the fine scheduler tuning for shops that need to squeeze out the last few percent for a dedicated work load.