PostgreSQL 9.3 beta: Federated databases and more
PostgreSQL 9.3 beta: Federated databases and more
Posted May 15, 2013 7:03 UTC (Wed) by ncm (guest, #165)Parent article: PostgreSQL 9.3 beta: Federated databases and more
About the same time, PG got a 64-bit block CRC using a polynomial extracted from a magnetic-tape format standard. I gather that modern cryptographic hashes can be computed faster, on modern hardware, than CRCs. Maybe it's time to reconsider that choice too?
It's gratifying to look back on decades of monotonic improvement along so many axes and recognize the mature leadership that has made it possible. It could so easily have gone off the rails at every point.
Posted May 15, 2013 18:00 UTC (Wed)
by dlang (guest, #313)
[Link] (20 responses)
Posted May 15, 2013 18:30 UTC (Wed)
by jberkus (guest, #55561)
[Link] (12 responses)
Posted May 15, 2013 19:03 UTC (Wed)
by ncm (guest, #165)
[Link]
Posted May 15, 2013 19:22 UTC (Wed)
by nix (subscriber, #2304)
[Link]
Also, mmap() implies page faults, which imply TLB shootdowns, which are slower than straight reads into already-allocated buffers as is generally done by read(). Combine that with the fact that EOF is fairly hard to detect, and appending is harder, and...
I wish mmap() was used for everything: it's a lovely unifying interface. But it's also a bit of a pig.
Posted May 15, 2013 19:38 UTC (Wed)
by andresfreund (subscriber, #69562)
[Link] (9 responses)
Posted May 15, 2013 20:58 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (8 responses)
Posted May 15, 2013 21:26 UTC (Wed)
by ncm (guest, #165)
[Link] (4 responses)
Posted May 15, 2013 23:17 UTC (Wed)
by andresfreund (subscriber, #69562)
[Link] (3 responses)
Writing out only the journal in an mmap()ed fashion would actually be far easier. But I don't see much benefit in that direction since only small amounts of data (up to maybe 64MB or so has been measured as being benefical) are held in memory for the log. And we frequently write to new files which would always requiring an mmap()/munmap() cycle (which actually sucks for concurrency).
Posted May 16, 2013 5:19 UTC (Thu)
by ncm (guest, #165)
[Link] (2 responses)
You seem to be describing a process more like a traditional store and write-ahead log, where first you write in the log all the changes are planned for the main store, and then lazily update the main store, writing it all again, knowing that if you are interrupted somebody else can replay the rest of the log. But I thought the great advantage of the PG scheme is that you only have to write once.
Maybe only metadata goes to the journal and is then copied out, while bulk data goes directly into unused blocks?
Posted May 16, 2013 7:17 UTC (Thu)
by andresfreund (subscriber, #69562)
[Link] (1 responses)
> You seem to be describing a process more like a traditional store and write-ahead log, where first you write in the log all the changes are planned for the main store, and then lazily update the main store, writing it all again, knowing that if you are interrupted somebody else can replay the rest of the log.
Postgres' implementation is a pretty classical write ahead log scheme that is far more like the second scheme you describe than the first one. And afaik has been since the introduction of crash safety (in 7.0 or so).
> But I thought the great advantage of the PG scheme is that you only have to write once.
Hm. Not sure what that corresponds to then? Postgres' WAL doesn't write full pages (except in some circumstances, but let's leave them out for now), but only a description of the change like 'insert tuple at slot X of page YYY) so amount of data that has to be fsync()ed for commit is reasonably small. Perhaps that is what you were referring to?
Posted May 16, 2013 8:10 UTC (Thu)
by ncm (guest, #165)
[Link]
Posted May 15, 2013 21:43 UTC (Wed)
by andresfreund (subscriber, #69562)
[Link] (2 responses)
Calling munmap()/mmap() everytimes that happens would also be prohibitively expensive, especially in concurrent situations, so we cannot just do it for the individual memory areas.
But that doesn't mean there isn't a way. I just don't know of anyone describing a realistic implementation strategy so far.
Posted May 16, 2013 0:08 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted May 16, 2013 0:14 UTC (Thu)
by andresfreund (subscriber, #69562)
[Link]
I think I am not following where you are going with this?
Posted May 16, 2013 12:56 UTC (Thu)
by heijo (guest, #88363)
[Link] (6 responses)
I hear they can share memory automatically and efficiently and have been available for more than 20 years.
Posted May 16, 2013 17:58 UTC (Thu)
by jberkus (guest, #55561)
[Link] (4 responses)
Posted May 18, 2013 2:44 UTC (Sat)
by ghane (guest, #1805)
[Link] (1 responses)
Posted May 20, 2013 17:49 UTC (Mon)
by zlynx (guest, #2285)
[Link]
Posted May 28, 2013 20:57 UTC (Tue)
by rpkelly (guest, #74224)
[Link] (1 responses)
Posted Jun 1, 2013 9:01 UTC (Sat)
by kleptog (subscriber, #1183)
[Link]
That said, if you restrict yourself to just the executor you primarily have to deal with the memory allocator and the disk buffers. Is it possible to make that thread-safe? I'm not sure anyone has tried. I think with only a few weeks work you could probably make something functional. However, convincing everyone that the solution is as robust as the current setup is much much harder.
Posted May 17, 2013 10:01 UTC (Fri)
by ras (subscriber, #33059)
[Link]
Memory doesn't have to be shared by all threads living in the same process. There are any number of ways, including explicitly shared and memory mapped files. These boil down to choosing to "not shared by default" instead of the "shared by" default model threads use. Speed of access to the memory is the same. The latter is safer, on multi machines with multi CPU's usually faster because less sharing means less cache thrashing. But there is an extra cost of creating a process which is why it loses on Windows.
Posted May 16, 2013 8:03 UTC (Thu)
by ptman (subscriber, #57271)
[Link] (4 responses)
Posted May 16, 2013 17:54 UTC (Thu)
by jberkus (guest, #55561)
[Link] (3 responses)
Posted May 16, 2013 18:09 UTC (Thu)
by andresfreund (subscriber, #69562)
[Link] (2 responses)
I think that ship has sailed and for the on-disk page checksums we are going with the modified FNV.
Explanations about the algorithm:
Posted May 17, 2013 7:26 UTC (Fri)
by Tobu (subscriber, #24111)
[Link] (1 responses)
PostgreSQL 9.3 beta: Federated databases and more
PostgreSQL 9.3 beta: Federated databases and more
mmapped table files
PostgreSQL 9.3 beta: Federated databases and more
PostgreSQL 9.3 beta: Federated databases and more
The problem is that we cannot influence the order in which the pages are flushed to disk. For crash safety we cannot allow any pages to be written out that have a LSN (Log Sequence Number := Address of the write ahead log record covering the last modification) bigger than the last LSN of the corresponding WAL that has been flushed out.
So we would have to be able to reliably prevent writeout on a page (postgres' ones, by default 8kb) granularity in an efficient manner.
PostgreSQL 9.3 beta: Federated databases and more
PostgreSQL 9.3 beta: Federated databases and more
PostgreSQL 9.3 beta: Federated databases and more
The point is that writeout for file writes needs to have interlock with the writes for the journal. You can only writeout a modified page if its corresponding log entry has already been written out.
PostgreSQL 9.3 beta: Federated databases and more
PostgreSQL 9.3 beta: Federated databases and more
PostgreSQL 9.3 beta: Federated databases and more
PostgreSQL 9.3 beta: Federated databases and more
I don't see how it could be done without destroying either the benefits (fixin memory waste by caching a buffer in pg and in the os) or harming other things. The PG code relies on quickly marking a buffer dirty, requiring to copy it somewhere else for that would be rather expensive.
PostgreSQL 9.3 beta: Federated databases and more
PostgreSQL 9.3 beta: Federated databases and more
Afair there are no checks made against it, so yes. But what would be the point? You need to modify the page first, which makes the write superflous? It doesn't prevent the kernel from writing out the page too early either.
PostgreSQL 9.3 beta: Federated databases and more
PostgreSQL 9.3 beta: Federated databases and more
PostgreSQL 9.3 beta: Federated databases and more
A programmer had a problem. He thought to himself, "I know, I'll solve it with threads!". has Now problems. two he
PostgreSQL 9.3 beta: Federated databases and more
PostgreSQL 9.3 beta: Federated databases and more
PostgreSQL 9.3 beta: Federated databases and more
PostgreSQL 9.3 beta: Federated databases and more
PostgreSQL 9.3 beta: Federated databases and more
PostgreSQL 9.3 beta: Federated databases and more
PostgreSQL 9.3 beta: Federated databases and more
http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;...
MurmurHash3 has better throughput and dispersion.
PostgreSQL 9.3 beta: Federated databases and more