LWN.net Logo

Also, consider DRBD

Also, consider DRBD

Posted Jun 1, 2012 13:30 UTC (Fri) by Richard_J_Neill (subscriber, #23093)
Parent article: Clustering, development, and galactic conquest at PGCon 2012

One other way to do database replication (between just two servers, where one is a hot standby) is to use DRBD at the filesystem layer. This actually works quite well, including good performance.

We have two adjacent servers, each of which has paired RAID 1 (mirrored) SSDs for /var/lib/pgsql/data. The servers are connected by a dedicated gigabit ethernet cable, and run drbd with protocol B. Postgresql runs only on the primary machine. When the application is informed of a successful Commit, the data is on the primary SSD, and at least in RAM on the secondary server.

It's possible to get completely seamless failover here, using eg heartbeat; though we prefer to require manual intervention because, for us, dataloss is bad; 5 minutes' downtime is acceptable in an emergency; "split-brain" would be a real problem.


(Log in to post comments)

Also, consider DRBD

Posted Jun 1, 2012 16:28 UTC (Fri) by jberkus (subscriber, #55561) [Link]

Richard,

A fair number of people do this, and it certainly works as far as redundancy and failover is concerned. However, since DRBD replicated the whole filesystem, it has to copy a LOT more data than database replication does, and as a result performance is characteristically much worse, especially with less than ideal disk or network speeds. The Postgres+DRBD systems I've consulted on had response times on small writes which were at least 3X longer than a standalone system.

There's a commercial company (I've forgotten the name) which has enhancements to DRBD including an robust asyncronous mode which improves on basic DRBD performance.

Also, consider DRBD

Posted Jun 1, 2012 16:44 UTC (Fri) by Richard_J_Neill (subscriber, #23093) [Link]

We had to deploy this into production ~ 3 years ago, when there were fewer alternatives. I agree that it hurt performance a bit, though in fact we still got pretty awesome performance: the secondary server was completely idle apart from the drbd slave, and the network was a single 6' cable between 2 dedicated gigabit cards. With protocol B, I think we found that DRBD's copying data to the secondary server's RAM took less time than for the primary server to write to its disk (at least for larger datasets).

I'm curious as to why replicating the whole filesystem has to copy much more data than other forms of database replication: I thought that typically the ext4 overhead was quite small?

As you say, for small writes, there are some problems, and there is some case for deciding on a per-transaction-type basis whether that transaction is critical, or slightly less critical. (i.e. if someone puts an axe through the server right now, how much do the last 20 ms of that type of data matter?). Either way, postgresql is an amazing product :-)

Also, consider DRBD

Posted Jun 1, 2012 17:52 UTC (Fri) by nix (subscriber, #2304) [Link]

When you copy the whole filesystem, you're copying across changes to the WAL (journal) as well as changes to the data/ files it is journalling. As the WAL is frequently (f)(data)sync()ed, this overhead is even higher. I am also sceptical as to its safety: it seems easy to me to synch across a change to the WAL on commit or WAL rollover but lose power before the corresponding change to the datafile is synched: on restart, the remote PostgreSQL will think that no WAL replay is necessary, when in fact one is needed.

PostgreSQL's native replication simply streams the WAL across, and replays it into the datafiles on the remote node. This is guaranteed safe.

Also, consider DRBD

Posted Jun 1, 2012 22:33 UTC (Fri) by jberkus (subscriber, #55561) [Link]

Nix,

We've had reason to destruction-test DRBD+Postgres. From a data safety perspective it works as well as one could hope, provided that you take steps to prevent split-brain. It's just performance which leaves something to be desired.

Also, consider DRBD

Posted Jun 4, 2012 12:36 UTC (Mon) by nix (subscriber, #2304) [Link]

Oh, I'm willing to believe it works most of the time. I just can't see how it's safe 100% of the time (when WAL rollover happens). However, it might be that the failure cases require long periods of downtime and very unluckily-timed powerdowns, in which case you might never see it.

But quite possibly I'm missing something.

Also, consider DRBD

Posted Jun 4, 2012 15:07 UTC (Mon) by andresfreund (subscriber, #69562) [Link]

As long as the snapshot is atomic it *has* to work. Otherwise the original purpose of the wal - crash recovery - wouldn't be met.
Checkpoints are crash safe. Whats the problem youre seeing there?
The checkpoint record is only written to the wal *after* everything but the checkpoint information has been written out. Only after the checkpoint has been fsynced to disk resources - like the wal - are reused.

Also, consider DRBD

Posted Jun 4, 2012 18:18 UTC (Mon) by nix (subscriber, #2304) [Link]

Agreed, this is perfectly fine. I now suspect that my memory is lying to me: it tells me faintly that DRBD may transmit data in arbitrary order and does not do a complete transmit on fsync(), but I now suspect I'm thinking of some other distributed block device and just mixed it up with DRBD. If DRBD respects fsync(), then everything works.

Also, consider DRBD

Posted Jun 4, 2012 18:25 UTC (Mon) by andresfreund (subscriber, #69562) [Link]

I think you can configure it in a way not all required guarantees are met. They are not generally recommended as far as I remember though.
...
Yep: http://www.drbd.org/users-guide/re-drbdconf.html check the docs for disk-barrier.

Also, consider DRBD

Posted Jun 1, 2012 21:25 UTC (Fri) by andresfreund (subscriber, #69562) [Link]

The problem is that you cannot use the 2nd server for anything (running backups, reporting queries, ...) that way. Also you have the problem that starting with crash recovery (which is what will happen if you failover to the 2nd server, because pg hasn't been cleanly shut down) can take a long time if you have a big time and allow for some wal to collect for performance reasons (checkpoint_segments).

Also, consider DRBD

Posted Jun 4, 2012 10:08 UTC (Mon) by niner (subscriber, #26151) [Link]

That's not entirely true. If your DRBD device is on top of LVM you can take a snapshot and mount this snapshot even on the secondary. We use such a setup for backups and it works just fine. You could even run PostgreSQL on this snapshot and do your reports from it.

Of course LVM snapshots have their own performance problems and just using PostgreSQL's native replication might be easier and faster, but still just using DRBD is a possibility which allows to cover not just the database.

Also, consider DRBD

Posted Jun 4, 2012 10:36 UTC (Mon) by andresfreund (subscriber, #69562) [Link]

Sure, you can do that or similar things (loads of SANs have that capability). Keeping the secondary pg instance up2date is really expensive in that scenario though. You need to shutdown pg, drop old, create new snapshot, start pg which will do recovery. If you have nontrivial amounts of writes the wal replay uppon startup will take quite some time....
I don't really see any reason to do so these days unless your database is gigantonormous and you cannot afford to have a full copy of the datadir for reporting.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds