LWN: Comments on "PostgreSQL 9.3 beta: Federated databases and more" https://lwn.net/Articles/550418/ This is a special feed containing comments posted to the individual LWN article titled "PostgreSQL 9.3 beta: Federated databases and more". en-us Thu, 04 Sep 2025 08:21:26 +0000 Thu, 04 Sep 2025 08:21:26 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/552558/ https://lwn.net/Articles/552558/ kleptog <div class="FormattedComment"> Threads are theoretically possible but have been held off primarily because it becomes very hard to reason about whether the code as it is is correct. There are a lot of shared structures (with other backends) with locking semantics based on individual processes and it's not immediately clear how that would interact with multiple threads.<br> <p> That said, if you restrict yourself to just the executor you primarily have to deal with the memory allocator and the disk buffers. Is it possible to make that thread-safe? I'm not sure anyone has tried. I think with only a few weeks work you could probably make something functional. However, convincing everyone that the solution is as robust as the current setup is much much harder.<br> </div> Sat, 01 Jun 2013 09:01:47 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/552056/ https://lwn.net/Articles/552056/ rpkelly <div class="FormattedComment"> Is it possible to use threads within a single backend easily? So that things like parallel sorting, joining, or reading could be achieved? Or maybe that doesn't make any sense.<br> </div> Tue, 28 May 2013 20:57:06 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/551218/ https://lwn.net/Articles/551218/ zlynx <div class="FormattedComment"> I love that joke.<br> </div> Mon, 20 May 2013 17:49:40 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/551111/ https://lwn.net/Articles/551111/ ghane <div class="FormattedComment"> May I point out:<br> A programmer had a problem. He thought to himself, "I know, I'll solve it with threads!". has Now problems. two he<br> <p> <a href="https://twitter.com/davidlohr/status/288786300067270656">https://twitter.com/davidlohr/status/288786300067270656</a><br> </div> Sat, 18 May 2013 02:44:23 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/551109/ https://lwn.net/Articles/551109/ luto And <a href="https://code.google.com/p/xxhash/">xxhash</a> claims to beat both. <p> <a href="https://131002.net/siphash/">SipHash</a> is also interesting, but probably not for this use case. Sat, 18 May 2013 01:04:37 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/551005/ https://lwn.net/Articles/551005/ ras <div class="FormattedComment"> Firebird (the perennial other open source SQL database hiding in the attic) supports both threaded and non-threaded modes. On Linux, where a fork() is cheap, nonthreaded wins. On Microsoft the "super server" (threaded) wins. The one notable difference only happens if you are running a beta version that will have the occasional crash. In that case the non-threaded mode is far more reliable on both platforms.<br> <p> Memory doesn't have to be shared by all threads living in the same process. There are any number of ways, including explicitly shared and memory mapped files. These boil down to choosing to "not shared by default" instead of the "shared by" default model threads use. Speed of access to the memory is the same. The latter is safer, on multi machines with multi CPU's usually faster because less sharing means less cache thrashing. But there is an extra cost of creating a process which is why it loses on Windows.<br> </div> Fri, 17 May 2013 10:01:01 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550999/ https://lwn.net/Articles/550999/ Tobu <a href="https://code.google.com/p/smhasher/">MurmurHash3</a> has better <a href="https://code.google.com/p/smhasher/wiki/MurmurHash3#Bulk_speed_test,_hashing_an_8-byte-aligned_256k_block">throughput</a> and <a href="http://blog.aggregateknowledge.com/2012/02/02/choosing-a-good-hash-function-part-3/">dispersion</a>. Fri, 17 May 2013 07:26:37 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550951/ https://lwn.net/Articles/550951/ andresfreund <div class="FormattedComment"> <font class="QuotedText">&gt; Actually, we're still arguing about which algorithm to use. Expect to see some algo changes in successive betas.</font><br> <p> I think that ship has sailed and for the on-disk page checksums we are going with the modified FNV.<br> <p> Explanations about the algorithm:<br> <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/storage/page/checksum.c;hb=HEAD">http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;...</a><br> </div> Thu, 16 May 2013 18:09:18 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550950/ https://lwn.net/Articles/550950/ jberkus <div class="FormattedComment"> There's a number of major disadvantages to threads as well, as anyone who's worked on (or with) MySQL or DB2 can tell you. Possibly chief among them is the inability to isolate resources between backends even when you want to; it's quite common for the unexpected termination of a single connection to shut down the server in threaded databases, and quite rare for that to happen in Postgres. We've debated the threads issue over the years on the mailing lists, and the consensus is that not only would moving to threads stall all other development for at least a year (more likely two), we'd just be exchanging one set of problems for another.<br> </div> Thu, 16 May 2013 17:58:12 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550949/ https://lwn.net/Articles/550949/ jberkus <div class="FormattedComment"> Actually, we're still arguing about which algorithm to use. Expect to see some algo changes in successive betas.<br> </div> Thu, 16 May 2013 17:54:22 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550879/ https://lwn.net/Articles/550879/ heijo <div class="FormattedComment"> Maybe someone should introduce the PostgreSQL "developers" to the newfangled concept of "threads".<br> <p> I hear they can share memory automatically and efficiently and have been available for more than 20 years.<br> <p> </div> Thu, 16 May 2013 12:56:21 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550850/ https://lwn.net/Articles/550850/ ncm <div class="FormattedComment"> So, more like the third paragraph, then. Still, disappointing. But fixable.<br> </div> Thu, 16 May 2013 08:10:45 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550848/ https://lwn.net/Articles/550848/ ptman <div class="FormattedComment"> BTRFS seems to use CRC and ZFS by default uses Fletcher4, although SHA256 is an option. I believe they've explored the options in depth, as I suspect the PostgreSQL dev team also has.<br> </div> Thu, 16 May 2013 08:03:21 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550843/ https://lwn.net/Articles/550843/ andresfreund <div class="FormattedComment"> <font class="QuotedText">&gt; Maybe I'm misunderstanding how PG does journaling these days. "Normally", on a no-overwrite store like PG's, a journal entry just says "block N containing metadata is now canon", which metadata is known to be on disk already, and identifies new data blocks also known to be on disk already, and other blocks that are now free. In this scenario, data and metadata blocks may be written out eagerly knowing they will all be ignored until the (tiny) journal entry that blesses the new metadata hits the disk.</font><br> <p> <font class="QuotedText">&gt; You seem to be describing a process more like a traditional store and write-ahead log, where first you write in the log all the changes are planned for the main store, and then lazily update the main store, writing it all again, knowing that if you are interrupted somebody else can replay the rest of the log.</font><br> <p> Postgres' implementation is a pretty classical write ahead log scheme that is far more like the second scheme you describe than the first one. And afaik has been since the introduction of crash safety (in 7.0 or so).<br> <p> <font class="QuotedText">&gt; But I thought the great advantage of the PG scheme is that you only have to write once.</font><br> <p> Hm. Not sure what that corresponds to then? Postgres' WAL doesn't write full pages (except in some circumstances, but let's leave them out for now), but only a description of the change like 'insert tuple at slot X of page YYY) so amount of data that has to be fsync()ed for commit is reasonably small. Perhaps that is what you were referring to?<br> </div> Thu, 16 May 2013 07:17:58 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550831/ https://lwn.net/Articles/550831/ ncm <div class="FormattedComment"> Maybe I'm misunderstanding how PG does journaling these days. "Normally", on a no-overwrite store like PG's, a journal entry just says "block N containing metadata is now canon", which metadata is known to be on disk already, and identifies new data blocks also known to be on disk already, and other blocks that are now free. In this scenario, data and metadata blocks may be written out eagerly knowing they will all be ignored until the (tiny) journal entry that blesses the new metadata hits the disk.<br> <p> You seem to be describing a process more like a traditional store and write-ahead log, where first you write in the log all the changes are planned for the main store, and then lazily update the main store, writing it all again, knowing that if you are interrupted somebody else can replay the rest of the log. But I thought the great advantage of the PG scheme is that you only have to write once.<br> <p> Maybe only metadata goes to the journal and is then copied out, while bulk data goes directly into unused blocks?<br> </div> Thu, 16 May 2013 05:19:41 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550809/ https://lwn.net/Articles/550809/ andresfreund <div class="FormattedComment"> <font class="QuotedText">&gt; Can you write() directly from mmaped pages?</font><br> Afair there are no checks made against it, so yes. But what would be the point? You need to modify the page first, which makes the write superflous? It doesn't prevent the kernel from writing out the page too early either.<br> <p> I think I am not following where you are going with this?<br> </div> Thu, 16 May 2013 00:14:50 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550808/ https://lwn.net/Articles/550808/ Cyberax <div class="FormattedComment"> Can you write() directly from mmaped pages?<br> </div> Thu, 16 May 2013 00:08:01 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550799/ https://lwn.net/Articles/550799/ andresfreund <div class="FormattedComment"> <font class="QuotedText">&gt; Or use mmap() for everything except the journal?</font><br> The point is that writeout for file writes needs to have interlock with the writes for the journal. You can only writeout a modified page if its corresponding log entry has already been written out.<br> <p> Writing out only the journal in an mmap()ed fashion would actually be far easier. But I don't see much benefit in that direction since only small amounts of data (up to maybe 64MB or so has been measured as being benefical) are held in memory for the log. And we frequently write to new files which would always requiring an mmap()/munmap() cycle (which actually sucks for concurrency).<br> </div> Wed, 15 May 2013 23:17:32 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550786/ https://lwn.net/Articles/550786/ andresfreund <div class="FormattedComment"> <font class="QuotedText">&gt; Can you combine both approaches? I.e. use write() to write actual data to disk but use mmap() for reads?</font><br> I don't see how it could be done without destroying either the benefits (fixin memory waste by caching a buffer in pg and in the os) or harming other things. The PG code relies on quickly marking a buffer dirty, requiring to copy it somewhere else for that would be rather expensive.<br> <p> Calling munmap()/mmap() everytimes that happens would also be prohibitively expensive, especially in concurrent situations, so we cannot just do it for the individual memory areas.<br> <p> But that doesn't mean there isn't a way. I just don't know of anyone describing a realistic implementation strategy so far.<br> </div> Wed, 15 May 2013 21:43:53 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550783/ https://lwn.net/Articles/550783/ ncm <div class="FormattedComment"> Or use mmap() for everything except the journal?<br> </div> Wed, 15 May 2013 21:26:13 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550778/ https://lwn.net/Articles/550778/ Cyberax <div class="FormattedComment"> Can you combine both approaches? I.e. use write() to write actual data to disk but use mmap() for reads?<br> </div> Wed, 15 May 2013 20:58:47 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550761/ https://lwn.net/Articles/550761/ andresfreund <div class="FormattedComment"> <font class="QuotedText">&gt; The big issue with mmap() for fileIO, as I understand it, is that we can never be sure when the file has been flushed to disk. For a guaranteed-durability database like PostgreSQL, that's not something we can live with.</font><br> The problem is that we cannot influence the order in which the pages are flushed to disk. For crash safety we cannot allow any pages to be written out that have a LSN (Log Sequence Number := Address of the write ahead log record covering the last modification) bigger than the last LSN of the corresponding WAL that has been flushed out.<br> So we would have to be able to reliably prevent writeout on a page (postgres' ones, by default 8kb) granularity in an efficient manner.<br> <p> <p> </div> Wed, 15 May 2013 19:38:11 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550756/ https://lwn.net/Articles/550756/ nix <div class="FormattedComment"> You can be sure of that with msync(). The problem with using mmap() for bulk file I/O is that it is very hard for the kernel to predict access patterns, so prefetching and the like don't work as well as they might without special madvise()ment from userspace, which next to nobody gives. Meanwhile, for read() and write() it just works.<br> <p> Also, mmap() implies page faults, which imply TLB shootdowns, which are slower than straight reads into already-allocated buffers as is generally done by read(). Combine that with the fact that EOF is fairly hard to detect, and appending is harder, and...<br> <p> I wish mmap() was used for everything: it's a lovely unifying interface. But it's also a bit of a pig.<br> </div> Wed, 15 May 2013 19:22:11 +0000 mmapped table files https://lwn.net/Articles/550747/ https://lwn.net/Articles/550747/ ncm <div class="FormattedComment"> Perhaps I am mistaken, but I believe that if PG were to keep the mmapped file's file descriptor open, fsync() would still work on the mmapped pages. The advantages of mmap(), here, would include that PG can avoid copy operations and share pages with the kernel's buffer cache, and the kernel can begin writing out dirtied file blocks early, in parallel with PG operations. PostgreSQL, with its never-overwrite storage model, seems uniquely positioned to take advantage of these properties.<br> </div> Wed, 15 May 2013 19:03:31 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550741/ https://lwn.net/Articles/550741/ jberkus <div class="FormattedComment"> The big issue with mmap() for fileIO, as I understand it, is that we can never be sure when the file has been flushed to disk. For a guaranteed-durability database like PostgreSQL, that's not something we can live with.<br> </div> Wed, 15 May 2013 18:30:29 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550737/ https://lwn.net/Articles/550737/ dlang <div class="FormattedComment"> This is not using mmap() for file IO, it's using it for communications between the different postgres processes.<br> </div> Wed, 15 May 2013 18:00:08 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550649/ https://lwn.net/Articles/550649/ ortalo <div class="FormattedComment"> Berkeley is not the only funny part (esp. if you have been invited to Ingres Corp. marketing lunches a few decades ago).<br> <p> <a href="http://www.postgresql.org/about/history/">http://www.postgresql.org/about/history/</a><br> <p> </div> Wed, 15 May 2013 07:43:34 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550646/ https://lwn.net/Articles/550646/ ncm <div class="FormattedComment"> I recall serious discussions, circa 2000, of using mmap() to get faster file i/o on modern systems. The idea was rejected because mmap() was not considered sufficiently universal, or uniformly faster than read/write(), on the myriad PG targets. Maybe the decision can now be revisited.<br> <p> About the same time, PG got a 64-bit block CRC using a polynomial extracted from a magnetic-tape format standard. I gather that modern cryptographic hashes can be computed faster, on modern hardware, than CRCs. Maybe it's time to reconsider that choice too?<br> <p> It's gratifying to look back on decades of monotonic improvement along so many axes and recognize the mature leadership that has made it possible. It could so easily have gone off the rails at every point.<br> </div> Wed, 15 May 2013 07:03:23 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550642/ https://lwn.net/Articles/550642/ felixfix <div class="FormattedComment"> Well ... not to hijack Jon's website too much ... I went to school there for a while and discovered I was an atrocious student ... but I do remember Ronnie Raygun's helicopter pepper spraying the crowds, and learned that my radical politics evaporated pretty quickly when I came across crowds breaking out firehoses and breaking windows ... learned a few shortcuts ... and have memories of the sort that are fonder from a distance :-)<br> </div> Wed, 15 May 2013 04:34:45 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550639/ https://lwn.net/Articles/550639/ jberkus <div class="FormattedComment"> Hah!<br> <p> Given my once-radical leftist politics, I was known as "Joshua Bezerkeley" in college.<br> <p> But no, Berkus is a Ukrainian name, actually.<br> </div> Wed, 15 May 2013 03:08:57 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550611/ https://lwn.net/Articles/550611/ Cyberax <div class="FormattedComment"> A little nominative determinism ( <a rel="nofollow" href="http://www.scilogs.com/counterbalanced/academic-nominative-determinism-the-years-best/">http://www.scilogs.com/counterbalanced/academic-nominativ...</a> ) can't hurt :)<br> </div> Tue, 14 May 2013 23:14:16 +0000 PostgreSQL 9.3 beta: Federated databases and more https://lwn.net/Articles/550602/ https://lwn.net/Articles/550602/ felixfix <div class="FormattedComment"> Well, I battled with myself, and the evil pun side won.<br> <p> I did not know of the connection between UC Berkeley and PostgreSQL.<br> <p> Is your name a pseudonym, possibly a pun? Or did you gravitate towards PostgreSQL because of your name?<br> <p> :-)<br> </div> Tue, 14 May 2013 22:17:57 +0000