PostgreSQL 9.3 beta: Federated databases and more

Posted May 15, 2013 19:38 UTC (Wed) by andresfreund (subscriber, #69562)
In reply to: PostgreSQL 9.3 beta: Federated databases and more by jberkus
Parent article: PostgreSQL 9.3 beta: Federated databases and more

> The big issue with mmap() for fileIO, as I understand it, is that we can never be sure when the file has been flushed to disk. For a guaranteed-durability database like PostgreSQL, that's not something we can live with.
The problem is that we cannot influence the order in which the pages are flushed to disk. For crash safety we cannot allow any pages to be written out that have a LSN (Log Sequence Number := Address of the write ahead log record covering the last modification) bigger than the last LSN of the corresponding WAL that has been flushed out.
So we would have to be able to reliably prevent writeout on a page (postgres' ones, by default 8kb) granularity in an efficient manner.

PostgreSQL 9.3 beta: Federated databases and more

Posted May 15, 2013 20:58 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)

Can you combine both approaches? I.e. use write() to write actual data to disk but use mmap() for reads?

PostgreSQL 9.3 beta: Federated databases and more

Posted May 15, 2013 21:26 UTC (Wed) by ncm (guest, #165) [Link] (4 responses)

Or use mmap() for everything except the journal?

PostgreSQL 9.3 beta: Federated databases and more

Posted May 15, 2013 23:17 UTC (Wed) by andresfreund (subscriber, #69562) [Link] (3 responses)

> Or use mmap() for everything except the journal?
The point is that writeout for file writes needs to have interlock with the writes for the journal. You can only writeout a modified page if its corresponding log entry has already been written out.

Writing out only the journal in an mmap()ed fashion would actually be far easier. But I don't see much benefit in that direction since only small amounts of data (up to maybe 64MB or so has been measured as being benefical) are held in memory for the log. And we frequently write to new files which would always requiring an mmap()/munmap() cycle (which actually sucks for concurrency).

PostgreSQL 9.3 beta: Federated databases and more

Posted May 16, 2013 5:19 UTC (Thu) by ncm (guest, #165) [Link] (2 responses)

Maybe I'm misunderstanding how PG does journaling these days. "Normally", on a no-overwrite store like PG's, a journal entry just says "block N containing metadata is now canon", which metadata is known to be on disk already, and identifies new data blocks also known to be on disk already, and other blocks that are now free. In this scenario, data and metadata blocks may be written out eagerly knowing they will all be ignored until the (tiny) journal entry that blesses the new metadata hits the disk.

You seem to be describing a process more like a traditional store and write-ahead log, where first you write in the log all the changes are planned for the main store, and then lazily update the main store, writing it all again, knowing that if you are interrupted somebody else can replay the rest of the log. But I thought the great advantage of the PG scheme is that you only have to write once.

Maybe only metadata goes to the journal and is then copied out, while bulk data goes directly into unused blocks?

PostgreSQL 9.3 beta: Federated databases and more

Posted May 16, 2013 7:17 UTC (Thu) by andresfreund (subscriber, #69562) [Link] (1 responses)

> Maybe I'm misunderstanding how PG does journaling these days. "Normally", on a no-overwrite store like PG's, a journal entry just says "block N containing metadata is now canon", which metadata is known to be on disk already, and identifies new data blocks also known to be on disk already, and other blocks that are now free. In this scenario, data and metadata blocks may be written out eagerly knowing they will all be ignored until the (tiny) journal entry that blesses the new metadata hits the disk.

> You seem to be describing a process more like a traditional store and write-ahead log, where first you write in the log all the changes are planned for the main store, and then lazily update the main store, writing it all again, knowing that if you are interrupted somebody else can replay the rest of the log.

Postgres' implementation is a pretty classical write ahead log scheme that is far more like the second scheme you describe than the first one. And afaik has been since the introduction of crash safety (in 7.0 or so).

> But I thought the great advantage of the PG scheme is that you only have to write once.

Hm. Not sure what that corresponds to then? Postgres' WAL doesn't write full pages (except in some circumstances, but let's leave them out for now), but only a description of the change like 'insert tuple at slot X of page YYY) so amount of data that has to be fsync()ed for commit is reasonably small. Perhaps that is what you were referring to?

PostgreSQL 9.3 beta: Federated databases and more

Posted May 16, 2013 8:10 UTC (Thu) by ncm (guest, #165) [Link]

So, more like the third paragraph, then. Still, disappointing. But fixable.

PostgreSQL 9.3 beta: Federated databases and more

Posted May 15, 2013 21:43 UTC (Wed) by andresfreund (subscriber, #69562) [Link] (2 responses)

> Can you combine both approaches? I.e. use write() to write actual data to disk but use mmap() for reads?
I don't see how it could be done without destroying either the benefits (fixin memory waste by caching a buffer in pg and in the os) or harming other things. The PG code relies on quickly marking a buffer dirty, requiring to copy it somewhere else for that would be rather expensive.

Calling munmap()/mmap() everytimes that happens would also be prohibitively expensive, especially in concurrent situations, so we cannot just do it for the individual memory areas.

But that doesn't mean there isn't a way. I just don't know of anyone describing a realistic implementation strategy so far.

PostgreSQL 9.3 beta: Federated databases and more

Posted May 16, 2013 0:08 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

Can you write() directly from mmaped pages?

PostgreSQL 9.3 beta: Federated databases and more

Posted May 16, 2013 0:14 UTC (Thu) by andresfreund (subscriber, #69562) [Link]

> Can you write() directly from mmaped pages?
Afair there are no checks made against it, so yes. But what would be the point? You need to modify the page first, which makes the write superflous? It doesn't prevent the kernel from writing out the page too early either.

I think I am not following where you are going with this?