PostgreSQL 9.4 beta: Binary JSON and Data Change Streaming
It's May, which means that it's time for a new PostgreSQL beta release. As with each annual release, PostgreSQL 9.4 has a few dozen new features addressing the various ways people use the database system. Among the features ready for testing are: index and aggregation performance improvements, materialized views, ALTER SYSTEM SET, dynamically loadable background workers, new array-manipulation syntax, security barrier views, and time-delayed standbys. While users all have their own favorites among the new features, this article will focus on two features that have received the most attention: the new JSONB type, and Data Change Streaming.
JSONB
JSON, which stands for "JavaScript Object Notation", is a serialization format originally designed to allow JavaScript programs to store stuff in disk or cache. It is a dialect of YAML, and consists of keys, values, and lists. Over the last few years JSON has become a kind of lingua franca for data interchange between programs written in different languages, largely replacing XML for new applications.
PostgreSQL 9.4 introduces a new data type called "JSONB" for "binary JSON". The name is somewhat deceptive; since JSON is a text serialization format, it's not binary anything. What JSONB does is implement storage for semantic JSON in a specialized compressible tree structure based on the PostgreSQL extension HStore. This binary structure allows the implementation of new features that make PostgreSQL JSON much more useful: indexing, transformation, sorting, path search, and matching keys and values. It also means better performance on some operations.
Version 9.2 introduced JSON text support to PostgreSQL. However, that data was stored as text, which had several disadvantages. All operations had to re-parse the JSON, even purely internal ones. It was also impossible to sort JSON values, or to determine the equivalence of two JSON strings since order and white space were preserved. For backward compatibility reasons, however, the original text JSON type will be retained in PostgreSQL, which requires users to choose between the two types. The old JSON type will continue to be useful for users who need to preserve white space and key ordering.
Indexing JSONB
The biggest feature advantage of JSONB is the ability to create general indexes on columns of JSONB values, which can then be used for almost any search on that data. These indexes use GIN (Generic INverted) indexes, which have also received a substantial (50% or more) performance boost for 9.4. As an example, imagine that we have this table of publication data for a set of books:
Table "public.booksdata"
Column | Type
----------+--------
title | citext
isbn | isbn
pubinfo | jsonb
We can create a GIN index on the bookdata column:
CREATE INDEX booksdex ON booksdata USING gin_hash_ops (pubinfo);
And then do path queries and matching against that data, which will use the index:
SELECT title, isbn, pubinfo #> '{ "whlscost" }' as cost
FROM booksdata
WHERE pubinfo @> '{ "publisher" : "Avon" }';
As with other exotic data types, the new features are supported by a mix of functions and operators which are available inside the SQL interface. For example, "@>" used above means "contains", and "#>" means "extract path". In this case, a "path" is a chain of hierarchical keys, such as "publisher, format, edition". Path queries will become even more useful in PostgreSQL 9.5, when wildcards will be supported.
The new type also supports doing element extraction, comparisons, and sorting, which allows treating JSON values like other kinds of data values. For example, given a set of heterogeneous JSON strings, we can get them to sort in a way which makes intuitive sense:
SELECT some_json FROM sortjson
ORDER BY some_json;
some_json
------------------------
{"a": 1, "b": "carol"}
{"a": 1, "b": "mel"}
{"a": 3, "b": "alice"}
{"b": "alice", "c": 2}
Version 9.4 also introduces several aggregation functions, which let you "roll up" JSON from an entire column:
SELECT title,
json_agg(pubinfo #>> '{ "published_on" }') as pub_dates
FROM booksdata
WHERE title = 'Sphere' group by title;
title | pub_dates
--------+------------------------------
Sphere | ["1999-03-04", "1998-07-04"]
This takes the publication date from a set of JSON documents, and aggregates them into a JSON array.
Utility and purpose of JSONB
There have been many questions inside and outside the PostgreSQL community (for example, on Reddit) as to why an advanced relational database is implementing functionality more associated with non-relational, or "NoSQL", databases. Some commenters, in particular, see JSONB as an abandonment of the relational model. So, why did the project implement it?
The first answer is that, historically, the PostgreSQL project has continually added new data types to support new kinds of data which need storage, indexing, and manipulation. Over the last 16 years, this has included IP address types, arrays and matrices, spatial data types, XML data, and others. So adding support for storing data in JSON format is just a continuation of that.
Perhaps more importantly, JSON supports having the application add new attributes to the data at runtime. One of the limitations of the SQL model has long been the lack of a safe way for a user application to add columns in response to a user action, such as data from a configuration control panel. PostgreSQL 8.3 previously bridged this gap with the HStore indexed key-value data type. JSONB is a furtherance of that concept that supports not only key-value data but also hierarchical data.
Existing PostgreSQL users also wanted richer JSON features to support their use of JSON as an output format and an API for web applications that interact with the database. This allows construction of simple two- or three-tier web and mobile applications that get back data as JSON. This data can then be passed through directly to a JavaScript client without further manipulation. In particular, the new operators and functions that come with JSONB will enable building more of these kinds of applications, faster.
Comparisons to NoSQL
Of course, the other big reason to implement indexed JSON storage inside PostgreSQL is competition with new, non-relational databases — the so-called NoSQL databases. Many of these databases use JSON as a storage or API format, which is appealing to web developers who are already familiar with it.
With JSONB, PostgreSQL now implements a large chunk of the JSON data manipulation features available in databases like MongoDB and CouchDB, with comparable performance. This is all implemented without sacrificing the database's relational capabilities, reliability, or multi-core performance. The project is betting that it can lure back users who like JSON but are unhappy with some of the limitations of the new databases.
The PostgreSQL project will need to implement more to become really attractive for use as a NoSQL database, though. A fast API that doesn't require SQL would be the first step, and projects like pgRest, Mongolike, and Mongres have all been working on this. A second requirement is much more difficult: sharding and horizontal scalability; this may possibly be addressed by the recently-released PostgresXL.
Other features remain on the "to do" list for PostgreSQL's JSON support that the developers plan to address, including: offering ways to update a single key in a large JSON document, wildcard support for path queries, and faster and more versatile indexing. That last item will be addressed as part of the VODKA project (yes, really), which will be launched at pgCon this week. One only wonders how long it will take PostgreSQL to support WHISKEY and RUM indexing as well.
Data Change Streaming
PostgreSQL's built-in replication has proved to be extremely popular and robust, allowing many users to improve redundancy and scale. However, it is limited to replicating the entire database in only one direction, which constrains the kind of scale-out architectures users can build with it. It also requires both master and replica to be running the same PostgreSQL version, preventing upgrade-by-replication.
There have been other systems to work around this, such as Slony-I, Bucardo, and Londiste. However, all of these limit both throughput and the changes application developers are allowed to make to the database. As a result, they have been unattractive for large scale-out infrastructures.
Data Change Streaming, a 9.4 feature added by 2nd Quadrant developer Andres Freund, provides a C API for listening to PostgreSQL's binary replication stream and extracting row changes and SQL statements from it. This feature has also been called "Changeset Extraction" and "Logical Decoding" at various points. This release also includes pg_recvlogical, a command-line utility that connects to PostgreSQL replication and writes out data changes to STDOUT or a file.
Data Change Streaming can become a game-changing feature for PostgreSQL, allowing development of sophisticated multi-directional replication (otherwise known as "multi-master"), and automated sharding systems, without sacrificing per-node performance. That will require the construction of an entire layer of tools on top of Data Change Streaming before most users are able to utilize it. The Slony-I project is already writing code to use the new API.
The biggest "to do" for the next version is to find a way to capture Data Definition Language (DDL) statements, which are the commands the user sends to create and modify table designs that are not represented in the data change stream in 9.4. Fixing this is a high priority for 9.5, so that the replication system doesn't interfere with continuous integration and development pushes, which may include such statements.
Other Features
As with other PostgreSQL releases, this one contains a large number of other features. Among them are:
ALTER SYSTEM SET: this new statement allows setting PostgreSQL configuration variables in the configuration file over a database connection. This enables easier auto-tuning and management of large numbers of PostgreSQL instances.
REFRESH MATERIALIZED VIEW CONCURRENTLY: this allows users to append or rebuild "materialized views" in the background while other users are still querying the old version. As materialized views are large, complex reporting queries whose results have been stored for quick reference, this will make PostgreSQL more useful as an analytics and decision support database.
Dynamic background workers: version 9.3 introduced the idea of the "background worker", a daemon that would start and stop with PostgreSQL and handle background tasks. Now, 9.4 adds the ability to make these workers dynamically loadable, which means that they can be launched in response to server tasks, permitting asynchronous activity, parallelism, and deferred maintenance.
This is just the first beta release, but the list of features is expected to be stable between this and the final release. Multiple betas will be released over the next four months, culminating in a final release sometime in September. Development on PostgreSQL version 9.5 will start in June.
[ Josh Berkus is a member of the PostgreSQL core team. ]
| Index entries for this article | |
|---|---|
| GuestArticles | Berkus, Josh |
