June 2, 2011
This article was contributed by Josh Berkus
PGCon is the international PostgreSQL contributor conference, held in Ottawa, Canada for the last four years. It serves the same purpose for the PostgreSQL community that the Linux Plumbers Conference does for Linux, attracting most of the top code contributors to PostgreSQL and its extensions and tools. Nearly 200 contributors from around the world attended this year, including hackers from as far away as Chile and the Ukraine, as well as a large contingent from Japan.
PostGIS knows where you are
"Everyone is being followed, all the time,"
said
Paul Ramsey, beginning his talk on PostGIS. Ramsey is the creator of PostGIS, the geographic data
extension to PostgreSQL, and works for OpenGeo as an architect. He talked
about how changes to the way we live are causing an explosion in the
adoption of geographic and spatial technologies. "PostGIS knows where you are. Should you be worried? No. Should
you be worried about Apple? Probably," he said.
He also went briefly into the
history of the Geographic Information System (GIS), which was invented in Canada, and then talked about the PostGIS project development.
PostGIS has been advancing rapidly, and according to Ramsey is now advancing GIS itself. "We ran out of things to implement in the standard, so we moved to a new standard that had more things to implement." Plans for PostGIS 2.0 include raster support, 3D and 4D indexing, and volumetric objects.
The first requests for raster support and volumetric objects have come from archaeologists and water system planners. These people need to represent three-dimensional objects underground; not only their shape but their relative position in three dimensions. Another area where volumetric shapes are essential is airspace; one attendee demonstrated an SVG and PostGIS application which showed Ottawa airport traffic in real time.
"4D indexing" means indexing objects not just in space but in time as well so that people and objects can be tracked as they move. The proliferation of public video cameras and geographic data from our cell phones is supplying a flood of data about motion as well as location. Currently no software exists which can represent this data and search it on a large scale. The PostGIS project plans to develop it.
The primary limitations PostGIS is struggling with today are write-scalability and lack of parallel processing for GIS data. The latter is an issue because rendering, transforming, and testing spatial objects — such as for answering the question "which building parcels border this one?" — are CPU-intensive. Currently you cannot use more than one core for a single GIS query.
Recently Foursquare had to turn away from PostGIS because it couldn't keep up with the hundreds of thousands of updates per second that the service gets. Both the transactional overhead and the time required to write serialized geometric objects were too much. They switched to custom software on top of Lucene and MongoDB. Ramsey hopes that PostgresXC will supply clustering that PostGIS can use.
Conference Sessions
Since PGCon consists of 34 sessions in three parallel tracks, I couldn't attend everything. Some of the most interesting talks included:
There were many other good sessions which I had no time to attend. However, I did make it to:
Hacking the query planner
"My name is Tom Lane, I've been hacking the query planner, and I'm looking for some help."
Possibly the most popular sessions at PGCon was Tom Lane's review and explanation of how the PostgreSQL query planner works. Tom Lane is the lead hacker on the PostgreSQL project, and his session, which ran overtime into a second session, was packed.
The query planner is part of answering database requests; it decides how to retrieve data most efficiently. Since even a moderately complex query can involve making decisions between millions of possible execution paths, the query planner code itself is a bit ... opaque.
Some of PostgreSQL's more than 25,000 lines of planner code is 20 years old. Many parts of the planner are best described as "kludgy," but Tom Lane and other lead developers are reluctant to modify them because they currently work fairly well. More importantly, the query planner is a holistic system with parts which interact in ways which are hard to predict. This has led to a lack of developers interested in making improvements due to the difficulty of the task. Lane gave his talk in hopes of changing that.
The planner is the most complex part of the four stages which go into answering a PostgreSQL query. First, the Parser decides the semantic meaning of the query string, separating out object identifiers, operators, values, and syntax. Second, the Rewriter expands views, rules, and aliases. Thirdly, the Planner chooses indexes, join plans, "flattens" subselects, and applies expressions to different tables in order to find the best query plan to execute. Lastly, the Executor executes the plan which the planner prepares.
The three goals of the planner are: to find a good (fast) query plan, to
not spend too much time finding it, and to support PostgreSQL's
extensibility by treating most operators, types, and functions as black box objects. The planner does this by forming a tree of "plan nodes," each of which represents a specific type of processing to be performed. Each node takes a stream of tuples (i.e. rows) in and outputs a stream of tuples.
Plan nodes consist of three types: Relation Scans, which read data from tables and indexes, Join nodes, which join two other nodes, and Special Plan nodes, which handle things like aggregates, functions scans, and UNIONs of two relations. Each node has a data source (the input stream), output columns and calculations, and qualifiers or filters. Nodes also have a set of statistics in order to estimate how much time it will take to do all of these things. Plan nodes are mostly autonomous from each other, but not always; for example, the planner takes output sort order into account when considering a merge join.
The first phase the planner goes through is an attempt to simplify the query. This includes simplifying expressions, "inlining" simple SQL functions, and "flattening" subselects and similar query structures into joins. This simplification both gives the planner the maximum flexibility in how to plan the query, and reduces repetitive re-processing for each tuple.
The planner then goes through Relation Scan and Join planning, which
roughly correspond to the FROM and WHERE portions of a SQL query. Of this
planning stage, the most expensive part is Join planning, since there is a
factorial expansion of the number of possible plans based on the number of
relations you are joining (i.e. n tables may be joined in n! ways). For a simple to moderately complex query, the planner will attempt to plan all possible execution plans, or Paths, it could take to perform the query.
For a very complex query — for example, one with a 25-table Join — the planner needs to resort to approximate planning methods, such as the "GEQO" genetic algorithm for creating query plans built into PostgreSQL. GEQO does not tend to produce very good query plans, however, and really needs improvement or replacement with another algorithm. There is also a set of simplification rules for complex Join trees, but since these are not estimate-based, they sometimes cause bad plans.
All of the preceding is needed to determine the possible Paths for
executing the query, so that the planner can calculate the cost of each
path and find the "cheapest" plan in terms of estimated execution time.
This estimate is created by combining a set of statistics about tables and
indexes, which is stored in the pg_statistic system table, with a set of estimation functions keyed to operators and data types. For each join or scan with a filter condition (such as a WHERE clause), the planner needs to estimate the selectivity of the conditions so that it can estimate the number of tuples returned and the cost of processing them.
Using both a histogram and stored selectivity estimates, the planner can usually calculate this fairly accurately, but often fails with complex expressions on interdependent relations or columns, such as AND/OR conditions on columns which are substantially correlated. The other large omission from this design which causes planning issues is the lack of estimate calculations for functions.
Lane also went over the need for major future work on the planner. Two
specific issues he mentioned were the need for parameterized scans for
subselects and functions, and the need to support estimates on Foreign Data
Tables in PostgreSQL 9.1. He also went over in detail some of the files
and code structures which make of the query planner. His slides
[PDF] are available on the PGCon website.
Developer Summits
The biggest change to the conference this year was the multiple summits and meetings held alongside and after the conference. Tuesday was the Clustering Hacker's Summit, Wednesday was the Developers' Meeting, and Saturday was the PL/Summit, a new event. The goal of these all-day meetings was better coordination of the hundreds of contributors in more than 20 countries who work on PostgreSQL
The Clustering Summit was the second such event, sponsored by NTT Open Source, who also sponsored the previous summit in Tokyo in 2009. PostgreSQL supports multiple external replication and clustering projects, many of them having only one or two developers. The thirty attendees included developers on the Slony-I, Bucardo, Londiste, pgPool-II, and GridSQL projects. Several NTT and EnterpriseDB staff there were representing a newer project, called PostgresXC, intended for write-scalable clustering.
For the last three years, twenty-five to thirty of the most prominent and prolific PostgreSQL code contributors have met in order to discuss development plans and projects. This is what has allowed the project to work on several features, such as binary replication, which have taken several years and several releases to complete. While there was a great deal of discussion at the summit, few decisions were made. The Developers' Meeting did set a schedule for 9.2 development, which will be very similar to last year's schedule, with CommitFests on June 15th, September 15th, November 15th, and January 15th, and a release sometime in the summer of 2012.
This year's PGCon also included the first summit of contributors to the various Procedural Language (PL) projects, and included developers of PL/Perl, PL/Python, PL/Lua, PL/PHP, PL/Tcl, and pgOpenCL.
Conclusion
There were some other announcements at PGCon. Marc Fournier, one of the founding Core Team members for PostgreSQL, retired from the Core Team. This follows several changes on the Core Team in the last year, including the retirement of founding member Jan Wieck and the addition of major contributor Magnus Hagander. Postgres Open, a new business-oriented PostgreSQL conference, was announced by its organizer.
Lest you think that PGCon was entirely brainiac database hacking, the most popular track was probably the unofficial Pub Track held at the Royal Oak Pub near campus, where Aaron Thul introduced The PostgreSQL Drinking Game [PDF] at the conference. Next year you could go to PGCon, and learn both how to hack databases and how to drink to them.
(
Log in to post comments)