Query planners are still dumb

Posted Aug 13, 2024 15:30 UTC (Tue) by cen (subscriber, #170575)
Parent article: Changes coming in PostgreSQL 17

I always wondered why planners don't perform some well known query optimizations like transforming pagination queries to deferred joins automatically.

Query planners are still dumb

Posted Aug 14, 2024 7:24 UTC (Wed) by Sesse (subscriber, #53779) [Link] (5 responses)

Having worked on an SQL planner and being fairly familiar with another, my response to this “well known” optimization is: What?

Query planners are still dumb

Posted Aug 14, 2024 10:33 UTC (Wed) by cen (subscriber, #170575) [Link] (4 responses)

search for "sql deferred join", it's a technique which can drastically improve performance of pagination queries

Query planners are still dumb

Posted Aug 14, 2024 10:39 UTC (Wed) by Sesse (subscriber, #53779) [Link]

From a quick search, this seems to be a rewrite of LIMIT/OFFSET queries (which you should very rarely use anyway if you care about performance; use keyset pagination!). I found a few variations, all of which were one or more of:

- Giving wrong results
- Completely bogus optimization-wise
- Solved better on a different layer of the database
- Only relevant to old versions of certain databases

So I don't think you'll see this happen in most databases anytime soon. Remember that a query optimizer can only do things that are guaranteed to give the correct result, and they need to be helpful rather than hurtful in the vast majority of cases, and finally, it needs to be a case that is common enough (ideally, in non-broken code) to warrant adding extra logic for.

Query planners are still dumb

Posted Sep 27, 2024 7:33 UTC (Fri) by ringerc (subscriber, #3071) [Link]

If you want this, use a refcursor, and WITH HOLD if needed.

That way your pagination will actually be correct, at the cost of extra DB resources held busy during user think-time.

I'm not sure quite what you're proposing for a way for the DB to help with pagination w/o keeping a query context around. PostgreSQL already does its best to re-order joins and push down filters to select the desired subset of data from a query, but this is generally not feasible if you're using LIMIT and OFFSET for pagination rather than key-ranges.

Query planners are still dumb

Posted Sep 27, 2024 7:41 UTC (Fri) by ringerc (subscriber, #3071) [Link] (1 responses)

As far as I can tell what you are asking for is for the DB to delay fetching of non-key row data until it knows the data won't be discarded by a LIMIT and OFFSET clause.

This doesn't really make sense with PostgreSQL's heap table structure. You have to read the row to get the key and ensure the row pointed to by the b-tree index is visible, even in an index scan.

The only case this strategy would be likely to be useful at all is where an index-only scan can be performed due to an up-to-date visibility map, and where the query is a trivial ORDER BY key_col LIMIT ... OFFSET ... such that the total ordering is completely dependent on that one index.

In that case I'm not sure if PostgreSQL will skip fetching the heap rows skipped by an OFFSET clause, and I agree it probably could if it doesn't. It generally will skip the much more expensive operation of fetching any external TOASTed data though.

LIMIT / OFFSET pagination is kind of terrible anyway. In most cases where pagination is unavoidable, it's better to use a predicate filter on the ordered attribute, e.g. `SELECT ... FROM sometable WHERE sortkey > $1 ORDER BY sortkey LIMIT 50` - where "$1" is the value of the last entry in the previously-seen page. Not only does this generally execute much more efficiently, but its results won't skip rows if the row-set changes between page requests.

Query planners are still dumb

Posted Oct 1, 2024 13:08 UTC (Tue) by raven667 (subscriber, #5198) [Link]

> `SELECT ... FROM sometable WHERE sortkey > $1 ORDER BY sortkey LIMIT 50` - where "$1" is the value of the last entry in the previously-seen page

I hadn't thought of that optimization before, I usually skip pagination and just dump whatever the result set is to the client and let them choke on it, but I'll try and keep that technique in my back pocket because that looks useful. I do use the Netbox REST API a lot which is a Django app, and have to deal with the performance impact of paginated queries, and I wonder if this was integrated into the Django ORM for those queries if that would make any difference, or if all the overhead is in JSON serialization and round trips between pages. GraphQL seems faster but I don't understand it at all so it's not that useful in practice, when I don't know how to do what I want.