User: Password:
|
|
Subscribe / Log in / New account

How 3.6 nearly broke PostgreSQL

How 3.6 nearly broke PostgreSQL

Posted Oct 3, 2012 1:01 UTC (Wed) by zlynx (subscriber, #2285)
In reply to: How 3.6 nearly broke PostgreSQL by xorbe
Parent article: How 3.6 nearly broke PostgreSQL

I agree.

My thought when I read about this problem is that PostgreSQL is causing it with their user-space locking and so PostgreSQL needs to fix it. Perhaps it should pin its processes to particular cores, ensuring that it spreads out across all the cores. It could also nice its worker processes while leaving the dispatcher at normal. Or it could negative-nice (would that be making the process mean?) its dispatcher.


(Log in to post comments)

How 3.6 nearly broke PostgreSQL

Posted Oct 3, 2012 2:34 UTC (Wed) by fdr (guest, #57064) [Link]

There is no user-space locking that is portable, bug-free, and fast enough to satisfy PostgreSQL on its supported platforms. So, as far as I know, this is pretty much off the table, and there are good reasons for that.

How 3.6 nearly broke PostgreSQL

Posted Oct 3, 2012 3:19 UTC (Wed) by josh (subscriber, #17465) [Link]

How much faster is PostgreSQL's userspace locking than futexes? In theory, futexes have no kernel overhead when acquired uncontended, and minimal overhead (just the overhead of blocking in the scheduler) when contended. The only case I can think of that would have less overhead would involve busy-waiting in userspace.

How 3.6 nearly broke PostgreSQL

Posted Oct 3, 2012 4:30 UTC (Wed) by fdr (guest, #57064) [Link]

I think someone wrote up a prototype (actually, I think that may be several across many years, this being the latest incarnation I know of):

http://archives.postgresql.org/pgsql-hackers/2012-06/msg0...

This actually uses futexes indirectly, in my understanding from the post, and it's not the most awful thing for Linux.

It's possible that Linux futexes are not a bad idea (but it's also not clearly a huge improvement), but s_lock does okay and has existed an awfully long time (with iteration), so there's some inertia there.

Also, I think I oversimplified the parent's post, he could have simply meant that something about s_lock is not very good, as opposed to "self-rolled user space spinlocks? That's ridiculous!" (as a straw man). It's possible s_lock could be improved, however, it was fine before and doesn't clearly seem at fault right now. I think the suggestion of having to manhandle the platform-specific scheduler also seems excessive. There may be a better solution to serve everyone; somehow I can't see PostgreSQL's spinlocks being one-of-a-kind in this pathology, but I haven't attempted to prove that.

How 3.6 nearly broke PostgreSQL

Posted Oct 3, 2012 18:56 UTC (Wed) by jberkus (subscriber, #55561) [Link]

> My thought when I read about this problem is that PostgreSQL is causing it with their user-space locking and so PostgreSQL needs to fix it.

So there's a couple major problems with the idea that this issue should be fixed in the PostgreSQL code:

1. PostgreSQL is not the only application with its own spinlock implementatation. There is, for example, Oracle.

2. While we might be able to change locking in future versions of PostgreSQL, we can't change locking in past ones.

Even if the next version of PostgreSQL (9.3) has a modified locking implementation which doesn't hit the issues in 3.6, the number of people running older versions of PostgreSQL will far outnumber the folks running the latest version for quite some time. What you'd be saying to all of those folks is, effectively, "don't upgrade Linux".

So any solution which hinges on "make these modifications to PostgreSQL" will instead result in PostgreSQL users deciding to stay on old versions of the Kernel. If this problem is equally bad for Oracle, you might even see RedHat refusing to deploy a version based on 3.6.

There's also the fact that PostgreSQL's locking implementation is complex and quite highly tuned. So the idea that the PostgreSQL project could make changes which wouldn't result in a worse regression in a few months is optimistic. Implementing things like futex support could take literally years.

How 3.6 nearly broke PostgreSQL

Posted Oct 3, 2012 19:00 UTC (Wed) by corbet (editor, #1) [Link]

FWIW, the discussion in kernelland was based on the assumption that this regression was the kernel's problem. Nobody there suggested telling the PostgreSQL developers to come up with a new locking scheme.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds