Why Uber dropped PostgreSQL
The rivalry between database management systems is often somewhat heated, with fans of one system often loudly proclaiming that a competitor "sucks" or similar. And the competition between MySQL and PostgreSQL in the open-source world has certainly been heated at times, which makes a recent discussion of the pros and cons of the two databases rather enlightening. While it involved technical criticism of the design decisions made by both, it lacked heat and instead focused on sober analysis of the differences and their implications.
The transportation company Uber had long used PostgreSQL as the storage back-end for a monolithic Python application, but that changed over the last year or two. Uber switched to using its own Schemaless sharding layer atop MySQL and on July 26 published a blog post by Evan Klitzke that set out to explain why the switch was made. There were a number of reasons behind it, but the main problem the company encountered involved rapid updates to a table with a large number of indexes. PostgreSQL did not handle that workload particularly well.
There were effectively two facets to the problem with the table: changing any indexed field would result in write amplification because each index also needed to be updated and all of those writes need to be replicated across multiple servers. One advantage that MySQL has, Klitzke said, is that it uses a level of indirection in the InnoDB storage engine so that an update to an indexed field does not require updating the unrelated indexes.
There were some other problems as well. Uber encountered a bug in PostgreSQL 9.2 that led to data corruption, which rightly caused a lot of consternation. The bug was fixed quickly, but some corrupted data did end up being replicated, which made things worse. However, the blog post seems to imply that this kind of problem is somehow PostgreSQL-specific and does not really acknowledge that bugs will occur in all database systems (really, all software, of course), including MySQL:
Another problem that Uber encountered was in upgrading to new PostgreSQL releases. The process it used was time-consuming and required quite a bit of downtime, which it could not afford. MySQL supports both binary replication (which is what PostgreSQL uses) and statement-level replication. The latter allows MySQL to be more easily upgraded in place, without significant downtime, Klitzke said.
Joshua D. Drake posted a pointer to the
blog post to the pgsql-hackers mailing list: "It is a very good read
and I encourage our hackers to do so with an open
mind.
" And it seems that's just what they did. While there was
disagreement with some of the statements and implications in the blog post,
there was also acknowledgment that some of the problems are real.
Josh Berkus restated the main
write-amplification problem in a more concrete way. If you have a large
enough
table for indexes to make sense and use that table in JOIN
statements throughout the application, indexing most of the columns may
make sense from a performance standpoint. But if that table is updated
frequently ("500 times per second
"), there is a problem:
The Uber guy is right that InnoDB handles this better as long as you don't touch the primary key (primary key updates in InnoDB are really bad).
This is a common problem case we don't have an answer for yet.
Bruce Momjian amended that slightly: "Or, basically, we don't have an answer to without making something else
worse.
" Tom Lane, though, saw it as
more of an annoyance, rather than "a time-for-a-new-database
kind of problem
". But both Berkus and Robert Haas disagreed; Haas
said that he has "seen multiple cases
where this kind of thing causes a
sufficiently large performance regression that the system just can't
keep up
". Berkus called it
"considerably more than an annoyance for the people who suffer from
it
", but agreed that it is not something that should, by itself,
cause a database switch.
Stephen Frost took issue with some of the
high-level criticisms in the blog post that were quoted by Drake. In
particular, the table-corruption problem is hardly PostgreSQL-specific:
"The
implication that MySQL doesn't have similar bugs is entirely incorrect,
as is the idea that logical replication would avoid data corruption
issues (in practice, it actually tends to be quite a bit worse).
"
In addition, there are ways to make upgrading to newer versions much less
painful:
He also wondered if Uber truly understood the write amplification problem and its implications. That is a theme that was taken up by Markus Winand in a blog post on the switch. In it, he agreed that MySQL might be the best choice for Uber's use case, but felt like there were some missing pieces in the explanation of the company's problems. The source of the problem is that Uber has an update-heavy workload that is evidently updating one or more of the indexed columns; otherwise PostgreSQL already has a solution:
He wondered if all of the indexes were actually needed, but that cannot be determined from the information in Uber's blog post. That post mentions InnoDB's indirection as an advantage, though, and downplays the penalty that comes from that indirection. Winand calls it the "clustered index penalty" and suggested that it can be substantial if queries are made using the secondary keys. The fact that Uber downplayed that penalty makes it appear that most of the queries use the primary index, which doesn't suffer from the penalty.
In the end, Winand concluded—in a somewhat snarky way—that what Uber is
looking for is a key/value store with a SQL front-end. Since InnoDB is a
"pretty solid and popular key/value store
" and that MySQL (and
MariaDB) provide SQL on top of it, it makes sense that it works well for
the company.
Simon Riggs also addressed Klitzke's post with a blog
post of his own. He welcomed Uber raising the points that it did, but
was concerned that "a number of important technical points are either
not correct or not wholly correct because they overlook many optimizations
in PostgreSQL that were added specifically to address the cases
discussed
". He noted the penalty for using indirect indexing on
secondary keys, as MySQL does, but also pointed out that PostgreSQL could
use indirect indexes some day as well:
The current architecture of PostgreSQL is that all index types are “direct”, whereas in InnoDB primary indexes are “direct” and secondary indexes “indirect”. There is no inherent architectural limitation that prevents PostgreSQL from also using indirect indexes, though it is true that has not been added yet.
Riggs also said that statement-level replication has performance and corner-case problems that make it unsuitable for PostgreSQL. It does save bandwidth, as Klitzke pointed out, but can lead to hard-to-diagnose replication problems.
The discussion on the mailing list was largely even-tempered and focused on
the problems at hand, much like all three of the blog posts mentioned above. Some solutions were considered and might become
PostgreSQL features down the road. In the end, publicly losing a big user
like Uber is perhaps a little unfortunate—and it does seem like there may
have been other factors in play, such as new pro-MySQL CTO—but it is in no
way a condemnation of PostgreSQL as a whole. In fact, as Berkus put it: "Even if they switched off, it's
still a nice
testimonial that they once ran their entire worldwide fleet off a single
Postgres cluster.
"
Overall, the "incident" demonstrates a sensible approach to criticism of a project: find the pieces that are truly problems and look at how to solve them. In the case of Uber, it may well be that it is best served by MySQL, but it is also likely that others with different needs will see things differently. Having several open-source software choices means that everyone can choose the right tool for their job.
Posted Aug 4, 2016 9:57 UTC (Thu)
by epa (subscriber, #39769)
[Link] (5 responses)
This suggests that logical replication would be a better fit for Postgres than physical replication, since with logical replication each replica can provide its own MVCC for read queries.
Posted Aug 4, 2016 12:30 UTC (Thu)
by Pinaraf (subscriber, #33153)
[Link]
The Uber article sadly lacks some crucial information, and fails to mention hot updates that prevent rewriting indexes, that feedback setting... With more information, it would be possible to better understand the issues they have, but it would be complicated for them to show their database schema I suppose.
Posted Aug 4, 2016 13:10 UTC (Thu)
by jberkus (guest, #55561)
[Link]
Note that the PostgreSQL community also agrees that having better logical replication (row-based) would also be a good thing, hence the "BDR" project: http://bdr-project.org/docs/stable/index.html
There are benefits to logical replication, but also drawbacks. For example, logical replication imposes a much higher query load on the replicas, making them less able to load-balance read workloads.
Posted Aug 4, 2016 15:51 UTC (Thu)
by blitzkrieg3 (guest, #57873)
[Link] (1 responses)
Posted Aug 5, 2016 20:01 UTC (Fri)
by flewellyn (subscriber, #5047)
[Link]
Posted Aug 11, 2016 7:38 UTC (Thu)
by ringerc (subscriber, #3071)
[Link]
Ideally we'd instead store blocks still needed on a replica, but no longer needed on the master, out-of-line using a copy-on-write mechanism of some form. Problem is that it'd have to do indexes as well as the heap, and when we do a tid lookup from an index we'd have to look up both the main heap block and some possibly large number of block copies. It'd get complex fast, and complex means buggy.
It'd also slow down queries on the replica, make replica change apply slower, etc.
Everything is trade-offs, and I think Uber's article misses that somewhat.
Posted Aug 5, 2016 0:45 UTC (Fri)
by brong (guest, #87268)
[Link] (8 responses)
I've seen this claim a few times around by people who clearly didn't read or understand the Uber article enough to understand that a whole class of corruptions are possible with the Postgres method of replication (raw binary log shipping) that are simply not possible in the same way with either row based or statement based structured replication. Sure if the bug is deterministic then replaying the same transactions on the replica will cause the same corruption - but if there's a bug that's dependent on particular server state that corrupts an underlying data structure - it's very likely that the replicas won't have that same on-disk corruption when they play a statement-based replication stream - so you can fail over to a replica and keep going. With Postgres shipping the raw data structures - if they corrupt on the master, that corruption goes straight to all the replicas without an additional sanity check.
Posted Aug 5, 2016 11:21 UTC (Fri)
by niner (subscriber, #26151)
[Link] (6 responses)
Posted Aug 5, 2016 12:10 UTC (Fri)
by brong (guest, #87268)
[Link] (4 responses)
can you please enumerate the sort of corruptions that occur with statement based replication?
The only sort I can think of are cases where the transactions get re-ordered in the statement log compared to the order they were actually applied on the master due to concurrency, and hence the replica falls out of sync.
Or cases where you flat out allow the two ends to be out of sync by manually fiddling replication log position so that you skip transactions. You can't really call that a bug in statement based replication though.
Posted Aug 5, 2016 15:11 UTC (Fri)
by paulj (subscriber, #341)
[Link] (1 responses)
With the low-level binary log replication, bugs that lead to corruption can replicate.
With the logical level replication, bugs that lead to logical level corruption can also cause inconsistent state. E.g., an update doesn't get applied to slaves because it isn't accepted, which could affect application consistency. Bugs at the binary log level may not replicate of themselves, but could cause a logical level replication to fail to replicate and cause inconsistent state.
Isn't it the case that the logical layer replication system has _two_ layers at which bugs can strike and cause significant problems? You now have two layers that need to be robust? And bugs in the lower layer can still take down the upper layer?
Posted Aug 5, 2016 21:58 UTC (Fri)
by brong (guest, #87268)
[Link]
If your low level data structures are corrupted - better have a good fsck and/or good backups, because you have have no replica with consistent state any more.
Posted Aug 7, 2016 16:43 UTC (Sun)
by krakensden (subscriber, #72039)
[Link]
Posted Aug 11, 2016 7:50 UTC (Thu)
by ringerc (subscriber, #3071)
[Link]
MySQL works around this somewhat by special-casing some functions, like now(). It evaluates them on the master and stores the results in the binlog, then ensures the invocations on the replica(s) return the same results as the master.
PgPool-II for PostgreSQL does something similar in statement based replication mode.
Clever, but solves only narrow cases. For example, in MySQL SYSDATE() still doesn't work safely. So you have to code very carefully to avoid breakage. (See https://dev.mysql.com/doc/refman/5.7/en/replication-featu...) .
By contrast, PostgreSQL's block-level replication leaves the replica an identical copy.
That's why in practice the most practical MySQL replication option is row-based replication or hybrid row/statement based replication. Many people who are talking about "statement based" replication here are really thinking of row-based replication, or the MIXED replication mode that MySQL can use to hybridize the two. Rather cleverly, I must say. ( https://dev.mysql.com/doc/refman/5.7/en/replication-forma..., https://dev.mysql.com/doc/refman/5.7/en/binary-log-mixed.... ).
That's what I'm involved in working on for PostgreSQL too, at 2ndQuadrant, in the form of BDR and pglogical. There's ongoing work to get this into PostgreSQL core. Though we're not planning on any sort of mixed replication mode at this point.
Posted Aug 7, 2016 3:54 UTC (Sun)
by giraffedata (guest, #1954)
[Link]
But are corruptions of that class as dangerous?
I take the complaint to be that with the WAL-based replication, a single trigger of a bug can cost you the whole cluster. But with logical replication, for all it's opportunities to fail, the most you will lose is one replica, and at worst you'll have to blow away that replica and replace it.
Is there a class of bug specific to MySQL that corrupts the entire cluster at once?
Posted Aug 12, 2016 10:04 UTC (Fri)
by moltonel (guest, #45207)
[Link]
Have a re-read of the article: the bug that affected Uber was not trickling from the master to all the replicas. Each replica had corruption on different rows. The mailing list thread also mentions that misconception.
While each replication strategy bring their own class of potential bugs (with statement-based replication generally seen as the most fragile kind), this particular bug was apparently not made more likely by Uber/PG's choice of replication architecture, and MySQL isn't shielded from that kind of bug either.
Replication and MVCC
Replication and MVCC
https://www.postgresql.org/docs/current/static/hot-standb... => hot_standby_feedback.
Replication and MVCC
Replication and MVCC
Replication and MVCC
Replication and MVCC
Why Uber dropped PostgreSQL
However, the blog post seems to imply that this kind of problem is somehow PostgreSQL-specific and does not really acknowledge that bugs will occur in all database systems (really, all software, of course), including MySQL
Why Uber dropped PostgreSQL
Why Uber dropped PostgreSQL
Why Uber dropped PostgreSQL
Why Uber dropped PostgreSQL
Why Uber dropped PostgreSQL
Why Uber dropped PostgreSQL
Why Uber dropped PostgreSQL
At the same time, logical replication like MySQL does bring a whole class of corruptions that are simply not possible in the same way with Postgres' WAL based replication.
Why Uber dropped PostgreSQL