Why Uber dropped PostgreSQL
Why Uber dropped PostgreSQL
Posted Aug 5, 2016 0:45 UTC (Fri) by brong (guest, #87268)Parent article: Why Uber dropped PostgreSQL
However, the blog post seems to imply that this kind of problem is somehow PostgreSQL-specific and does not really acknowledge that bugs will occur in all database systems (really, all software, of course), including MySQL
I've seen this claim a few times around by people who clearly didn't read or understand the Uber article enough to understand that a whole class of corruptions are possible with the Postgres method of replication (raw binary log shipping) that are simply not possible in the same way with either row based or statement based structured replication.
Sure if the bug is deterministic then replaying the same transactions on the replica will cause the same corruption - but if there's a bug that's dependent on particular server state that corrupts an underlying data structure - it's very likely that the replicas won't have that same on-disk corruption when they play a statement-based replication stream - so you can fail over to a replica and keep going. With Postgres shipping the raw data structures - if they corrupt on the master, that corruption goes straight to all the replicas without an additional sanity check.
Posted Aug 5, 2016 11:21 UTC (Fri)
by niner (subscriber, #26151)
[Link] (6 responses)
Posted Aug 5, 2016 12:10 UTC (Fri)
by brong (guest, #87268)
[Link] (4 responses)
can you please enumerate the sort of corruptions that occur with statement based replication?
The only sort I can think of are cases where the transactions get re-ordered in the statement log compared to the order they were actually applied on the master due to concurrency, and hence the replica falls out of sync.
Or cases where you flat out allow the two ends to be out of sync by manually fiddling replication log position so that you skip transactions. You can't really call that a bug in statement based replication though.
Posted Aug 5, 2016 15:11 UTC (Fri)
by paulj (subscriber, #341)
[Link] (1 responses)
With the low-level binary log replication, bugs that lead to corruption can replicate.
With the logical level replication, bugs that lead to logical level corruption can also cause inconsistent state. E.g., an update doesn't get applied to slaves because it isn't accepted, which could affect application consistency. Bugs at the binary log level may not replicate of themselves, but could cause a logical level replication to fail to replicate and cause inconsistent state.
Isn't it the case that the logical layer replication system has _two_ layers at which bugs can strike and cause significant problems? You now have two layers that need to be robust? And bugs in the lower layer can still take down the upper layer?
Posted Aug 5, 2016 21:58 UTC (Fri)
by brong (guest, #87268)
[Link]
If your low level data structures are corrupted - better have a good fsck and/or good backups, because you have have no replica with consistent state any more.
Posted Aug 7, 2016 16:43 UTC (Sun)
by krakensden (subscriber, #72039)
[Link]
Posted Aug 11, 2016 7:50 UTC (Thu)
by ringerc (subscriber, #3071)
[Link]
MySQL works around this somewhat by special-casing some functions, like now(). It evaluates them on the master and stores the results in the binlog, then ensures the invocations on the replica(s) return the same results as the master.
PgPool-II for PostgreSQL does something similar in statement based replication mode.
Clever, but solves only narrow cases. For example, in MySQL SYSDATE() still doesn't work safely. So you have to code very carefully to avoid breakage. (See https://dev.mysql.com/doc/refman/5.7/en/replication-featu...) .
By contrast, PostgreSQL's block-level replication leaves the replica an identical copy.
That's why in practice the most practical MySQL replication option is row-based replication or hybrid row/statement based replication. Many people who are talking about "statement based" replication here are really thinking of row-based replication, or the MIXED replication mode that MySQL can use to hybridize the two. Rather cleverly, I must say. ( https://dev.mysql.com/doc/refman/5.7/en/replication-forma..., https://dev.mysql.com/doc/refman/5.7/en/binary-log-mixed.... ).
That's what I'm involved in working on for PostgreSQL too, at 2ndQuadrant, in the form of BDR and pglogical. There's ongoing work to get this into PostgreSQL core. Though we're not planning on any sort of mixed replication mode at this point.
Posted Aug 7, 2016 3:54 UTC (Sun)
by giraffedata (guest, #1954)
[Link]
But are corruptions of that class as dangerous?
I take the complaint to be that with the WAL-based replication, a single trigger of a bug can cost you the whole cluster. But with logical replication, for all it's opportunities to fail, the most you will lose is one replica, and at worst you'll have to blow away that replica and replace it.
Is there a class of bug specific to MySQL that corrupts the entire cluster at once?
Posted Aug 12, 2016 10:04 UTC (Fri)
by moltonel (guest, #45207)
[Link]
Have a re-read of the article: the bug that affected Uber was not trickling from the master to all the replicas. Each replica had corruption on different rows. The mailing list thread also mentions that misconception.
While each replication strategy bring their own class of potential bugs (with statement-based replication generally seen as the most fragile kind), this particular bug was apparently not made more likely by Uber/PG's choice of replication architecture, and MySQL isn't shielded from that kind of bug either.
Why Uber dropped PostgreSQL
Why Uber dropped PostgreSQL
Why Uber dropped PostgreSQL
Why Uber dropped PostgreSQL
Why Uber dropped PostgreSQL
Why Uber dropped PostgreSQL
Why Uber dropped PostgreSQL
At the same time, logical replication like MySQL does bring a whole class of corruptions that are simply not possible in the same way with Postgres' WAL based replication.
Why Uber dropped PostgreSQL