Why Uber dropped PostgreSQL

By Jake Edge
August 3, 2016

The rivalry between database management systems is often somewhat heated, with fans of one system often loudly proclaiming that a competitor "sucks" or similar. And the competition between MySQL and PostgreSQL in the open-source world has certainly been heated at times, which makes a recent discussion of the pros and cons of the two databases rather enlightening. While it involved technical criticism of the design decisions made by both, it lacked heat and instead focused on sober analysis of the differences and their implications.

The transportation company Uber had long used PostgreSQL as the storage back-end for a monolithic Python application, but that changed over the last year or two. Uber switched to using its own Schemaless sharding layer atop MySQL and on July 26 published a blog post by Evan Klitzke that set out to explain why the switch was made. There were a number of reasons behind it, but the main problem the company encountered involved rapid updates to a table with a large number of indexes. PostgreSQL did not handle that workload particularly well.

There were effectively two facets to the problem with the table: changing any indexed field would result in write amplification because each index also needed to be updated and all of those writes need to be replicated across multiple servers. One advantage that MySQL has, Klitzke said, is that it uses a level of indirection in the InnoDB storage engine so that an update to an indexed field does not require updating the unrelated indexes.

There were some other problems as well. Uber encountered a bug in PostgreSQL 9.2 that led to data corruption, which rightly caused a lot of consternation. The bug was fixed quickly, but some corrupted data did end up being replicated, which made things worse. However, the blog post seems to imply that this kind of problem is somehow PostgreSQL-specific and does not really acknowledge that bugs will occur in all database systems (really, all software, of course), including MySQL:

The bug we ran into only affected certain releases of Postgres 9.2 and has been fixed for a long time now. However, we still find it worrisome that this class of bug can happen at all. A new version of Postgres could be released at any time that has a bug of this nature, and because of the way replication works, this issue has the potential to spread into all of the databases in a replication hierarchy.

Another problem that Uber encountered was in upgrading to new PostgreSQL releases. The process it used was time-consuming and required quite a bit of downtime, which it could not afford. MySQL supports both binary replication (which is what PostgreSQL uses) and statement-level replication. The latter allows MySQL to be more easily upgraded in place, without significant downtime, Klitzke said.

Joshua D. Drake posted a pointer to the blog post to the pgsql-hackers mailing list: "It is a very good read and I encourage our hackers to do so with an open mind." And it seems that's just what they did. While there was disagreement with some of the statements and implications in the blog post, there was also acknowledgment that some of the problems are real.

Josh Berkus restated the main write-amplification problem in a more concrete way. If you have a large enough table for indexes to make sense and use that table in JOIN statements throughout the application, indexing most of the columns may make sense from a performance standpoint. But if that table is updated frequently ("500 times per second"), there is a problem:

That's a recipe for runaway table bloat; VACUUM can't do much because there's always some minutes-old transaction hanging around (and SNAPSHOT TOO OLD doesn't really help, we're talking about minutes here), and because of all of the indexes HOT isn't effective. Removing the indexes is equally painful because it means less efficient JOINs.

The Uber guy is right that InnoDB handles this better as long as you don't touch the primary key (primary key updates in InnoDB are really bad).

This is a common problem case we don't have an answer for yet.

Bruce Momjian amended that slightly: "Or, basically, we don't have an answer to without making something else worse." Tom Lane, though, saw it as more of an annoyance, rather than "a time-for-a-new-database kind of problem". But both Berkus and Robert Haas disagreed; Haas said that he has "seen multiple cases where this kind of thing causes a sufficiently large performance regression that the system just can't keep up". Berkus called it "considerably more than an annoyance for the people who suffer from it", but agreed that it is not something that should, by itself, cause a database switch.

Stephen Frost took issue with some of the high-level criticisms in the blog post that were quoted by Drake. In particular, the table-corruption problem is hardly PostgreSQL-specific: "The implication that MySQL doesn't have similar bugs is entirely incorrect, as is the idea that logical replication would avoid data corruption issues (in practice, it actually tends to be quite a bit worse)." In addition, there are ways to make upgrading to newer versions much less painful:

Their specific issue with these upgrades was solved, years ago, by me (and it wasn't particularly difficult to do...) through the use of pg_upgrade's --link option and rsync's ability to construct hard link trees. Making major release upgrades easier with less downtime is certainly a good goal, but there's been a solution to the specific issue they had here for quite a while.

He also wondered if Uber truly understood the write amplification problem and its implications. That is a theme that was taken up by Markus Winand in a blog post on the switch. In it, he agreed that MySQL might be the best choice for Uber's use case, but felt like there were some missing pieces in the explanation of the company's problems. The source of the problem is that Uber has an update-heavy workload that is evidently updating one or more of the indexed columns; otherwise PostgreSQL already has a solution:

However, there is a little bit more speculation possible based upon something that is not written in Uber's article: The article doesn't mention PostgreSQL Heap-Only-Tuples (HOT). From the PostgreSQL source, HOT is useful for the special case "where a tuple is repeatedly updated in ways that do not change its indexed columns." In that case, PostgreSQL is able to do the update without touching any index if the new row-version can be stored in the same page as the previous version. The latter condition can be tuned using the fillfactor setting. Assuming Uber's Engineering is aware of this means that HOT is no solution to their problem because the updates they run at high frequency affect at least one indexed column.

He wondered if all of the indexes were actually needed, but that cannot be determined from the information in Uber's blog post. That post mentions InnoDB's indirection as an advantage, though, and downplays the penalty that comes from that indirection. Winand calls it the "clustered index penalty" and suggested that it can be substantial if queries are made using the secondary keys. The fact that Uber downplayed that penalty makes it appear that most of the queries use the primary index, which doesn't suffer from the penalty.

In the end, Winand concluded—in a somewhat snarky way—that what Uber is looking for is a key/value store with a SQL front-end. Since InnoDB is a "pretty solid and popular key/value store" and that MySQL (and MariaDB) provide SQL on top of it, it makes sense that it works well for the company.

Simon Riggs also addressed Klitzke's post with a blog post of his own. He welcomed Uber raising the points that it did, but was concerned that "a number of important technical points are either not correct or not wholly correct because they overlook many optimizations in PostgreSQL that were added specifically to address the cases discussed". He noted the penalty for using indirect indexing on secondary keys, as MySQL does, but also pointed out that PostgreSQL could use indirect indexes some day as well:

Thus, it is possible to construct cases in which PostgreSQL consistently beats InnoDB, or vice versa. In the “common case” PostgreSQL beats InnoDB on reads and is roughly equal on writes for btree access. What we should note is that PostgreSQL has the widest selection of index types of any database system and this is an area of strength, not weakness.

The current architecture of PostgreSQL is that all index types are “direct”, whereas in InnoDB primary indexes are “direct” and secondary indexes “indirect”. There is no inherent architectural limitation that prevents PostgreSQL from also using indirect indexes, though it is true that has not been added yet.

Riggs also said that statement-level replication has performance and corner-case problems that make it unsuitable for PostgreSQL. It does save bandwidth, as Klitzke pointed out, but can lead to hard-to-diagnose replication problems.

The discussion on the mailing list was largely even-tempered and focused on the problems at hand, much like all three of the blog posts mentioned above. Some solutions were considered and might become PostgreSQL features down the road. In the end, publicly losing a big user like Uber is perhaps a little unfortunate—and it does seem like there may have been other factors in play, such as new pro-MySQL CTO—but it is in no way a condemnation of PostgreSQL as a whole. In fact, as Berkus put it: "Even if they switched off, it's still a nice testimonial that they once ran their entire worldwide fleet off a single Postgres cluster."

Overall, the "incident" demonstrates a sensible approach to criticism of a project: find the pieces that are truly problems and look at how to solve them. In the case of Uber, it may well be that it is best served by MySQL, but it is also likely that others with different needs will see things differently. Having several open-source software choices means that everyone can choose the right tool for their job.

Replication and MVCC

Posted Aug 4, 2016 9:57 UTC (Thu) by epa (subscriber, #39769) [Link] (5 responses)

What I found most surprising is that Postgres replication doesn't play nicely with MVCC (multi-version concurrency control). On a single database server you can open a read transaction and not block any writers, as old versions of the data are kept around for you if necessary. But with replication the master doesn't know what read transactions may be open on the replicas, so it has to be conservative and also impose some time limit on transactions. (That's my limited understanding.)

This suggests that logical replication would be a better fit for Postgres than physical replication, since with logical replication each replica can provide its own MVCC for read queries.

Replication and MVCC

Posted Aug 4, 2016 12:30 UTC (Thu) by Pinaraf (subscriber, #33153) [Link]

That's one of the issues with the Uber article : there is a PostgreSQL setting preventing that issue :
https://www.postgresql.org/docs/current/static/hot-standb... => hot_standby_feedback.

The Uber article sadly lacks some crucial information, and fails to mention hot updates that prevent rewriting indexes, that feedback setting... With more information, it would be possible to better understand the issues they have, but it would be complicated for them to show their database schema I suppose.

Replication and MVCC

Posted Aug 4, 2016 13:10 UTC (Thu) by jberkus (guest, #55561) [Link]

Well, their issue with "query cancel", which is what this particular problem is called, was partly that they were on 9.2, where hot_standby_feedback didn't work as well as it does now. For PostgreSQL today, there's an effective tradeoff, depending on whether you care more about avoiding query cancel or avoiding load on the master.

Note that the PostgreSQL community also agrees that having better logical replication (row-based) would also be a good thing, hence the "BDR" project: http://bdr-project.org/docs/stable/index.html

There are benefits to logical replication, but also drawbacks. For example, logical replication imposes a much higher query load on the replicas, making them less able to load-balance read workloads.

Replication and MVCC

Posted Aug 4, 2016 15:51 UTC (Thu) by blitzkrieg3 (guest, #57873) [Link] (1 responses)

This seems like a serious limitation. They could easily design a protocol to fix this, eg the master can ask all clients to disallow transactions to a version and only proceed when all replica gives a go ahead.

Replication and MVCC

Posted Aug 5, 2016 20:01 UTC (Fri) by flewellyn (subscriber, #5047) [Link]

Read the other replies on this thread, and you'll see that they DID design a protocol to fix this.

Replication and MVCC

Posted Aug 11, 2016 7:38 UTC (Thu) by ringerc (subscriber, #3071) [Link]

In fairness, using hot_standby_feedback does increase bloat on the master and therefore impact master performance. Especially in situations like theirs with high churn tables.

Ideally we'd instead store blocks still needed on a replica, but no longer needed on the master, out-of-line using a copy-on-write mechanism of some form. Problem is that it'd have to do indexes as well as the heap, and when we do a tid lookup from an index we'd have to look up both the main heap block and some possibly large number of block copies. It'd get complex fast, and complex means buggy.

It'd also slow down queries on the replica, make replica change apply slower, etc.

Everything is trade-offs, and I think Uber's article misses that somewhat.

Why Uber dropped PostgreSQL

Posted Aug 5, 2016 0:45 UTC (Fri) by brong (guest, #87268) [Link] (8 responses)

However, the blog post seems to imply that this kind of problem is somehow PostgreSQL-specific and does not really acknowledge that bugs will occur in all database systems (really, all software, of course), including MySQL

I've seen this claim a few times around by people who clearly didn't read or understand the Uber article enough to understand that a whole class of corruptions are possible with the Postgres method of replication (raw binary log shipping) that are simply not possible in the same way with either row based or statement based structured replication.

Sure if the bug is deterministic then replaying the same transactions on the replica will cause the same corruption - but if there's a bug that's dependent on particular server state that corrupts an underlying data structure - it's very likely that the replicas won't have that same on-disk corruption when they play a statement-based replication stream - so you can fail over to a replica and keep going. With Postgres shipping the raw data structures - if they corrupt on the master, that corruption goes straight to all the replicas without an additional sanity check.

Why Uber dropped PostgreSQL

Posted Aug 5, 2016 11:21 UTC (Fri) by niner (subscriber, #26151) [Link] (6 responses)

At the same time, logical replication like MySQL does brings a whole class of corruptions that are simply not possible in the same way with Postgres' WAL based replication. So where does that leave us?

Why Uber dropped PostgreSQL

Posted Aug 5, 2016 12:10 UTC (Fri) by brong (guest, #87268) [Link] (4 responses)

[citation needed]

can you please enumerate the sort of corruptions that occur with statement based replication?

The only sort I can think of are cases where the transactions get re-ordered in the statement log compared to the order they were actually applied on the master due to concurrency, and hence the replica falls out of sync.

Or cases where you flat out allow the two ends to be out of sync by manually fiddling replication log position so that you skip transactions. You can't really call that a bug in statement based replication though.

Why Uber dropped PostgreSQL

Posted Aug 5, 2016 15:11 UTC (Fri) by paulj (subscriber, #341) [Link] (1 responses)

Well, we're talking about bugs. Anything is possible, right?

With the low-level binary log replication, bugs that lead to corruption can replicate.

With the logical level replication, bugs that lead to logical level corruption can also cause inconsistent state. E.g., an update doesn't get applied to slaves because it isn't accepted, which could affect application consistency. Bugs at the binary log level may not replicate of themselves, but could cause a logical level replication to fail to replicate and cause inconsistent state.

Isn't it the case that the logical layer replication system has _two_ layers at which bugs can strike and cause significant problems? You now have two layers that need to be robust? And bugs in the lower layer can still take down the upper layer?

Why Uber dropped PostgreSQL

Posted Aug 5, 2016 21:58 UTC (Fri) by brong (guest, #87268) [Link]

If the response to a logical update failing to apply is rejecting the update, then you know that your replication is broken, and you haven't lost anything except the most recent changes - and you can skip that update and apply something manually to fix it while you fail over to a replica and bring it as close to up-to-date as possible.

If your low level data structures are corrupted - better have a good fsck and/or good backups, because you have have no replica with consistent state any more.

Why Uber dropped PostgreSQL

Posted Aug 7, 2016 16:43 UTC (Sun) by krakensden (subscriber, #72039) [Link]

MySQL has a nice list in their documentation- statement based replication causes corruption if you use nondeterministic functions like now() or random(). If everyone in your org is aware of this, things can work, but I have definitely seen it not work.

Why Uber dropped PostgreSQL

Posted Aug 11, 2016 7:50 UTC (Thu) by ringerc (subscriber, #3071) [Link]

Simple statement-based replication is overwhelmingly flawed. Most importantly, it's completely broken with respect to "volatile" functions, sequence generation, etc. It's utterly hopeless. It can produce different results to what it did on the master in concurrent execution. AUTO_INCREMENT, NOW() and SYSDATE() etc would be very broken.

MySQL works around this somewhat by special-casing some functions, like now(). It evaluates them on the master and stores the results in the binlog, then ensures the invocations on the replica(s) return the same results as the master.

PgPool-II for PostgreSQL does something similar in statement based replication mode.

Clever, but solves only narrow cases. For example, in MySQL SYSDATE() still doesn't work safely. So you have to code very carefully to avoid breakage. (See https://dev.mysql.com/doc/refman/5.7/en/replication-featu...) .

By contrast, PostgreSQL's block-level replication leaves the replica an identical copy.

That's why in practice the most practical MySQL replication option is row-based replication or hybrid row/statement based replication. Many people who are talking about "statement based" replication here are really thinking of row-based replication, or the MIXED replication mode that MySQL can use to hybridize the two. Rather cleverly, I must say. ( https://dev.mysql.com/doc/refman/5.7/en/replication-forma..., https://dev.mysql.com/doc/refman/5.7/en/binary-log-mixed.... ).

That's what I'm involved in working on for PostgreSQL too, at 2ndQuadrant, in the form of BDR and pglogical. There's ongoing work to get this into PostgreSQL core. Though we're not planning on any sort of mixed replication mode at this point.

Why Uber dropped PostgreSQL

Posted Aug 7, 2016 3:54 UTC (Sun) by giraffedata (guest, #1954) [Link]

At the same time, logical replication like MySQL does bring a whole class of corruptions that are simply not possible in the same way with Postgres' WAL based replication.

But are corruptions of that class as dangerous?

I take the complaint to be that with the WAL-based replication, a single trigger of a bug can cost you the whole cluster. But with logical replication, for all it's opportunities to fail, the most you will lose is one replica, and at worst you'll have to blow away that replica and replace it.

Is there a class of bug specific to MySQL that corrupts the entire cluster at once?

Why Uber dropped PostgreSQL

Posted Aug 12, 2016 10:04 UTC (Fri) by moltonel (guest, #45207) [Link]

> Sure if the bug is deterministic then replaying the same transactions on the replica will cause the same corruption - but if there's a bug that's dependent on particular server state that corrupts an underlying data structure - it's very likely that the replicas won't have that same on-disk corruption when they play a statement-based replication stream - so you can fail over to a replica and keep going. With Postgres shipping the raw data structures - if they corrupt on the master, that corruption goes straight to all the replicas without an additional sanity check.

Have a re-read of the article: the bug that affected Uber was not trickling from the master to all the replicas. Each replica had corruption on different rows. The mailing list thread also mentions that misconception.

While each replication strategy bring their own class of potential bugs (with statement-based replication generally seen as the most fragile kind), this particular bug was apparently not made more likely by Uber/PG's choice of replication architecture, and MySQL isn't shielded from that kind of bug either.