|
|
Log in / Subscribe / Register

PostgreSQL 9.4 beta: Binary JSON and Data Change Streaming

May 21, 2014

This article was contributed by Josh Berkus

It's May, which means that it's time for a new PostgreSQL beta release. As with each annual release, PostgreSQL 9.4 has a few dozen new features addressing the various ways people use the database system. Among the features ready for testing are: index and aggregation performance improvements, materialized views, ALTER SYSTEM SET, dynamically loadable background workers, new array-manipulation syntax, security barrier views, and time-delayed standbys. While users all have their own favorites among the new features, this article will focus on two features that have received the most attention: the new JSONB type, and Data Change Streaming.

JSONB

JSON, which stands for "JavaScript Object Notation", is a serialization format originally designed to allow JavaScript programs to store stuff in disk or cache. It is a dialect of YAML, and consists of keys, values, and lists. Over the last few years JSON has become a kind of lingua franca for data interchange between programs written in different languages, largely replacing XML for new applications.

PostgreSQL 9.4 introduces a new data type called "JSONB" for "binary JSON". The name is somewhat deceptive; since JSON is a text serialization format, it's not binary anything. What JSONB does is implement storage for semantic JSON in a specialized compressible tree structure based on the PostgreSQL extension HStore. This binary structure allows the implementation of new features that make PostgreSQL JSON much more useful: indexing, transformation, sorting, path search, and matching keys and values. It also means better performance on some operations.

Version 9.2 introduced JSON text support to PostgreSQL. However, that data was stored as text, which had several disadvantages. All operations had to re-parse the JSON, even purely internal ones. It was also impossible to sort JSON values, or to determine the equivalence of two JSON strings since order and white space were preserved. For backward compatibility reasons, however, the original text JSON type will be retained in PostgreSQL, which requires users to choose between the two types. The old JSON type will continue to be useful for users who need to preserve white space and key ordering.

Indexing JSONB

The biggest feature advantage of JSONB is the ability to create general indexes on columns of JSONB values, which can then be used for almost any search on that data. These indexes use GIN (Generic INverted) indexes, which have also received a substantial (50% or more) performance boost for 9.4. As an example, imagine that we have this table of publication data for a set of books:

    Table "public.booksdata"
      Column  |  Type
    ----------+--------
     title    | citext
     isbn     | isbn
     pubinfo  | jsonb

We can create a GIN index on the bookdata column:

    CREATE INDEX booksdex ON booksdata USING gin_hash_ops (pubinfo);

And then do path queries and matching against that data, which will use the index:

    SELECT title, isbn, pubinfo #> '{ "whlscost" }' as cost
    FROM booksdata
    WHERE pubinfo @> '{ "publisher" : "Avon" }';

As with other exotic data types, the new features are supported by a mix of functions and operators which are available inside the SQL interface. For example, "@>" used above means "contains", and "#>" means "extract path". In this case, a "path" is a chain of hierarchical keys, such as "publisher, format, edition". Path queries will become even more useful in PostgreSQL 9.5, when wildcards will be supported.

The new type also supports doing element extraction, comparisons, and sorting, which allows treating JSON values like other kinds of data values. For example, given a set of heterogeneous JSON strings, we can get them to sort in a way which makes intuitive sense:

    SELECT some_json FROM sortjson
    ORDER BY some_json;
	   some_json
    ------------------------
    {"a": 1, "b": "carol"}
    {"a": 1, "b": "mel"}
    {"a": 3, "b": "alice"}
    {"b": "alice", "c": 2}

Version 9.4 also introduces several aggregation functions, which let you "roll up" JSON from an entire column:

    SELECT title,
      json_agg(pubinfo #>> '{ "published_on" }') as pub_dates
    FROM booksdata
    WHERE title = 'Sphere' group by title;

     title  |           pub_dates
    --------+------------------------------
     Sphere | ["1999-03-04", "1998-07-04"]

This takes the publication date from a set of JSON documents, and aggregates them into a JSON array.

Utility and purpose of JSONB

There have been many questions inside and outside the PostgreSQL community (for example, on Reddit) as to why an advanced relational database is implementing functionality more associated with non-relational, or "NoSQL", databases. Some commenters, in particular, see JSONB as an abandonment of the relational model. So, why did the project implement it?

The first answer is that, historically, the PostgreSQL project has continually added new data types to support new kinds of data which need storage, indexing, and manipulation. Over the last 16 years, this has included IP address types, arrays and matrices, spatial data types, XML data, and others. So adding support for storing data in JSON format is just a continuation of that.

Perhaps more importantly, JSON supports having the application add new attributes to the data at runtime. One of the limitations of the SQL model has long been the lack of a safe way for a user application to add columns in response to a user action, such as data from a configuration control panel. PostgreSQL 8.3 previously bridged this gap with the HStore indexed key-value data type. JSONB is a furtherance of that concept that supports not only key-value data but also hierarchical data.

Existing PostgreSQL users also wanted richer JSON features to support their use of JSON as an output format and an API for web applications that interact with the database. This allows construction of simple two- or three-tier web and mobile applications that get back data as JSON. This data can then be passed through directly to a JavaScript client without further manipulation. In particular, the new operators and functions that come with JSONB will enable building more of these kinds of applications, faster.

Comparisons to NoSQL

Of course, the other big reason to implement indexed JSON storage inside PostgreSQL is competition with new, non-relational databases — the so-called NoSQL databases. Many of these databases use JSON as a storage or API format, which is appealing to web developers who are already familiar with it.

With JSONB, PostgreSQL now implements a large chunk of the JSON data manipulation features available in databases like MongoDB and CouchDB, with comparable performance. This is all implemented without sacrificing the database's relational capabilities, reliability, or multi-core performance. The project is betting that it can lure back users who like JSON but are unhappy with some of the limitations of the new databases.

The PostgreSQL project will need to implement more to become really attractive for use as a NoSQL database, though. A fast API that doesn't require SQL would be the first step, and projects like pgRest, Mongolike, and Mongres have all been working on this. A second requirement is much more difficult: sharding and horizontal scalability; this may possibly be addressed by the recently-released PostgresXL.

Other features remain on the "to do" list for PostgreSQL's JSON support that the developers plan to address, including: offering ways to update a single key in a large JSON document, wildcard support for path queries, and faster and more versatile indexing. That last item will be addressed as part of the VODKA project (yes, really), which will be launched at pgCon this week. One only wonders how long it will take PostgreSQL to support WHISKEY and RUM indexing as well.

Data Change Streaming

PostgreSQL's built-in replication has proved to be extremely popular and robust, allowing many users to improve redundancy and scale. However, it is limited to replicating the entire database in only one direction, which constrains the kind of scale-out architectures users can build with it. It also requires both master and replica to be running the same PostgreSQL version, preventing upgrade-by-replication.

There have been other systems to work around this, such as Slony-I, Bucardo, and Londiste. However, all of these limit both throughput and the changes application developers are allowed to make to the database. As a result, they have been unattractive for large scale-out infrastructures.

Data Change Streaming, a 9.4 feature added by 2nd Quadrant developer Andres Freund, provides a C API for listening to PostgreSQL's binary replication stream and extracting row changes and SQL statements from it. This feature has also been called "Changeset Extraction" and "Logical Decoding" at various points. This release also includes pg_recvlogical, a command-line utility that connects to PostgreSQL replication and writes out data changes to STDOUT or a file.

Data Change Streaming can become a game-changing feature for PostgreSQL, allowing development of sophisticated multi-directional replication (otherwise known as "multi-master"), and automated sharding systems, without sacrificing per-node performance. That will require the construction of an entire layer of tools on top of Data Change Streaming before most users are able to utilize it. The Slony-I project is already writing code to use the new API.

The biggest "to do" for the next version is to find a way to capture Data Definition Language (DDL) statements, which are the commands the user sends to create and modify table designs that are not represented in the data change stream in 9.4. Fixing this is a high priority for 9.5, so that the replication system doesn't interfere with continuous integration and development pushes, which may include such statements.

Other Features

As with other PostgreSQL releases, this one contains a large number of other features. Among them are:

ALTER SYSTEM SET: this new statement allows setting PostgreSQL configuration variables in the configuration file over a database connection. This enables easier auto-tuning and management of large numbers of PostgreSQL instances.

REFRESH MATERIALIZED VIEW CONCURRENTLY: this allows users to append or rebuild "materialized views" in the background while other users are still querying the old version. As materialized views are large, complex reporting queries whose results have been stored for quick reference, this will make PostgreSQL more useful as an analytics and decision support database.

Dynamic background workers: version 9.3 introduced the idea of the "background worker", a daemon that would start and stop with PostgreSQL and handle background tasks. Now, 9.4 adds the ability to make these workers dynamically loadable, which means that they can be launched in response to server tasks, permitting asynchronous activity, parallelism, and deferred maintenance.

This is just the first beta release, but the list of features is expected to be stable between this and the final release. Multiple betas will be released over the next four months, culminating in a final release sometime in September. Development on PostgreSQL version 9.5 will start in June.

[ Josh Berkus is a member of the PostgreSQL core team. ]

Index entries for this article
GuestArticlesBerkus, Josh


to post comments

PostgreSQL 9.4 beta: Binary JSON and Data Change Streaming

Posted May 21, 2014 18:24 UTC (Wed) by mbanck (guest, #9035) [Link] (2 responses)

It's Andres Freund, not Andreas.

Probably also in order to mention that JSONB/GIN got at least partly implemented by the original authors of HStore, Oleg Bartunov, Teodor Sigaev and (new to the show) Alexander Korotkov, along with Andrew Dunstan and Peter Geoghegan.

Fixed

Posted May 21, 2014 18:47 UTC (Wed) by corbet (editor, #1) [Link]

Sigh, we should have caught that one. Fixed now, thanks.

PostgreSQL 9.4 beta: Binary JSON and Data Change Streaming

Posted May 21, 2014 19:19 UTC (Wed) by jberkus (guest, #55561) [Link]

Gah, darned automated spellcheck. I really need to add Andres' name to my dictionary.

In the end, I didn't want to put the whole list of contributor names into the JSONB portion of the article, so thank you for mentioning them here in the comments!

PostgreSQL 9.4 beta: Binary JSON and Data Change Streaming

Posted May 21, 2014 21:23 UTC (Wed) by amk (subscriber, #19) [Link] (8 responses)

Correction to paragraph 2: I don't think JSON is a dialect of YAML in any sense. YAML uses multiple lines for lists and mappings, while JSON is a subset of JavaScript syntax. They look very different.

PostgreSQL 9.4 beta: Binary JSON and Data Change Streaming

Posted May 21, 2014 21:38 UTC (Wed) by MegabytePhreak (guest, #60945) [Link] (7 responses)

In typical use, they do tend to look different. However, any valid JSON is also valid YAML with the same meaning. YAML just has so much flexibility that it can also be written in a manner (almost?)totally devoid of curly braces.

PostgreSQL 9.4 beta: Binary JSON and Data Change Streaming

Posted May 21, 2014 21:58 UTC (Wed) by SEJeff (guest, #51588) [Link]

Relevant docs: http://yaml.org/spec/1.2/spec.html#id2759572

This was news to me, kind of crazy.

YAML vs JSON (was PostgreSQL 9.4 beta: Binary JSON and Data Change Streaming)

Posted May 22, 2014 0:10 UTC (Thu) by dskoll (subscriber, #1630) [Link] (4 responses)

YAML started out as an interesting serialization format, but it has since accreted features and become so complex that IMO it's a nightmare.

Compare the JSON spec to the YAML spec. Night and day.

YAML can, of course, represent things JSON can't such as self-referential data structures, but these are seldom needed for data interchange.

YAML vs JSON (was PostgreSQL 9.4 beta: Binary JSON and Data Change Streaming)

Posted May 22, 2014 12:03 UTC (Thu) by zenaan (guest, #3778) [Link] (2 responses)

I have worked with YAML and find it awesome as a human-maintainable data/ config/ arbitrary-information storage format.

In particular, for anything that starts to get at all complex like user interface descriptions, complex config files and database descriptions (this is the limit of what I've worked with personally, but I've no doubt of its flexibility), when interacting with humans, and subscribing to 'everything must be a text file', YAML is as good as it gets:
easy to read
easy to edit
easy to represent anything

Also, for me, it's just fine for a simple starting point which is not complex at all, and over time as needs expand, YAML is always there for me and never says "that's a bit complex, you need a workaround". It can be complex, but only when I actually need it.

OK, enough proselytizing.

YAML vs JSON (was PostgreSQL 9.4 beta: Binary JSON and Data Change Streaming)

Posted May 22, 2014 13:35 UTC (Thu) by dskoll (subscriber, #1630) [Link] (1 responses)

It can be complex, but only when I actually need it.

That's true of YAML-the-data-interchange-format. The problem is that a YAML parsing library must be able to handle the full spec in all its glory. This introduces bloat as well as the possibility (likelihood, even) of bugs because of the complexity of the spec.

Also, JSON can be made fairly human-readable with judicious use of whitespace. It's just that most JSON generators don't bother with that and spit out very compressed JSON.

YAML vs JSON (was PostgreSQL 9.4 beta: Binary JSON and Data Change Streaming)

Posted May 22, 2014 13:38 UTC (Thu) by liw (subscriber, #6379) [Link]

My problem with writing JSON manually is that it's fairly verbose,
and also it doesn't seem to handle long text blocks sensibly: they
all need to be on one line.

Thus, I prefer YAML when, for example, needing to embed long Markdown
strings, but JSON as a serialisation format between programs, when
humans don't need to touch it.

YAML vs JSON (was PostgreSQL 9.4 beta: Binary JSON and Data Change Streaming)

Posted Dec 18, 2014 16:57 UTC (Thu) by epa (subscriber, #39769) [Link]

Yes, I think YAML's evolutionary niche is squeezed a bit by JSON on one side and XML on the other. JSON is simple; XML is complex but formally defined and standardized, and has a wide range of APIs and processing tools such as DOM and XSLT - which in turn may not always be pearls of minimalist beauty, but are well-defined standards with several high-quality implementations. And for fast serialization of program data, I'd probably pick something like Sereal.

JSON vs YAML

Posted May 22, 2014 7:29 UTC (Thu) by jnareb (subscriber, #46500) [Link]

I have heard that while JSON could be considered a subset of YAML, there is problem with Unicode contents.

redis?

Posted May 21, 2014 22:04 UTC (Wed) by ncm (guest, #165) [Link] (7 responses)

I've become a bit disconnected from this world... When people mention prominent nosql data stores, nobody seems to mention redis. Is redis considered a dead end, not a legitimate member of the club, or just too obscure?

redis?

Posted May 22, 2014 0:37 UTC (Thu) by zlynx (guest, #2285) [Link] (2 responses)

Well, I've never heard of it. Maybe the project needs to start advertising and create a cute mascot.

Create some controversy for tech news to report on by having some of its founders fight and fork the project and the mascots so one gets sunglasses and other one gets a mohawk.

redis?

Posted May 22, 2014 13:13 UTC (Thu) by SEJeff (guest, #51588) [Link] (1 responses)

Seriously? You don't get out much do you :)

Redis has effectively dethroned memcached as one of the de-facto caching tools used for any large lamp stack architecture. Redis doesn't need to do any marketing, the tech sells it (it is OSS).

The author also happens to solve some super difficult problems with clever ideas: http://antirez.com/news/75

Moral of the story, if you don't know about redis it isn't because they need to market more. It is because you need to explore some of the tech around you.

redis?

Posted May 22, 2014 14:54 UTC (Thu) by intgr (subscriber, #39733) [Link]

> The author also happens to solve some super difficult problems with clever ideas: http://antirez.com/news/75

Worth mentioning given the topic of the article, PostgreSQL had an extension for HyperLogLog a year before (http://research.neustar.biz/2013/02/04/open-source-releas...) and some other probabilistic estimators since at least 2011: https://github.com/tvondra/distinct_estimators

redis?

Posted May 22, 2014 1:25 UTC (Thu) by jberkus (guest, #55561) [Link]

Well, Redis doesn't use JSON. So it wasn't really relevant for this article. Aside from that, it's a great database which I use all the time.

redis?

Posted May 22, 2014 7:43 UTC (Thu) by oseemann (subscriber, #6687) [Link] (2 responses)

Redis is not built to be a persistent and durable data store, I consider it more like a fancy memcached. It holds all its data in RAM and is very vulnerable to power loss issues as it syncs data to disk rather infrequently (every few minutes in the default config).

So it's not a suitable replacement for an RDBMS for most people.

redis?

Posted May 22, 2014 8:40 UTC (Thu) by oseemann (subscriber, #6687) [Link]

Correction: I just checked the docs again and Redis also has a WAL equivalent called AOF that can be configured to sync after each query.

http://redis.io/topics/persistence

redis?

Posted May 22, 2014 9:54 UTC (Thu) by moltonel (subscriber, #45207) [Link]

Redis can do proper persistence nowadays, in a maner fairly similar to Postgres's WAL logging. You just need to configure it according to your needs. http://redis.io/topics/persistence

Redis is great. It's not a replacement for relational or nosql databases, it's complementary. More than a cache, less than a full-blown db. Many projects use redis alongside other databases.

PostgreSQL 9.4 beta: Binary JSON and Data Change Streaming

Posted May 22, 2014 3:02 UTC (Thu) by andresfreund (subscriber, #69562) [Link] (4 responses)

> Data Change Streaming, a 9.4 feature added by 2nd Quadrant developer Andres Freund, provides a C API for listening to PostgreSQL's binary replication stream and extracting row changes and SQL statements from it. This feature has also been called "Changeset Extraction" and "Logical Decoding" at various points. This release also includes pg_recvlogical, a command-line utility that connects to PostgreSQL replication and writes out data changes to STDOUT or a file.

So, to be a bit more precise, the API for listening to changes isn't actually primarily exposed via C (although it's possible to consume changes that way). The most powerful interface is to listen to changes via a special kind of connection to the database. That allows to receive the changes in a streaming fashion and supports implementing synchronous replication. For other use cases there's a API callable directly from SQL that returns all changes since the last time that function has been called.

The C API that's at the moment hard to avoid are the so called 'output plugins'. These format the data from postgres' internal format into the format the consumer wants. The only such plugin shipped with postgresql isn't necessarily a generally useful format, but there's already projects on github supporting json and SQL statements as output.

PostgreSQL 9.4 beta: Binary JSON and Data Change Streaming

Posted May 22, 2014 3:11 UTC (Thu) by flewellyn (subscriber, #5047) [Link] (3 responses)

So, what's the upshot of the data change streaming? Does it mean the ability to do "differential logical backups", where you can output the changes as SQL?

That would be incredibly useful.

PostgreSQL 9.4 beta: Binary JSON and Data Change Streaming

Posted May 22, 2014 3:23 UTC (Thu) by andresfreund (subscriber, #69562) [Link]

All the capabilities for that are now there, yes. Some plumbing is still missing to make it easy and to integrate e.g. with pg_dump.

PostgreSQL 9.4 beta: Binary JSON and Data Change Streaming

Posted May 22, 2014 7:17 UTC (Thu) by kleptog (subscriber, #1183) [Link] (1 responses)

There are all sorts of potential uses.

- Audits, tracking changes to a table
- Replicating a single database/table or even just parts of a table.
- On the fly statistics aggregation
- Differential backups
- Anything you can think of in answer to "what could I do with a stream of all changes in a database"

It's new so the full scope of possibilities it not yet visible, but it has the potential to be the basis of some awesome new features.

PostgreSQL 9.4 beta: Binary JSON and Data Change Streaming

Posted May 22, 2014 17:06 UTC (Thu) by flewellyn (subscriber, #5047) [Link]

All of those things sound awesome, and I can envision uses for them at my workplace immediately.


Copyright © 2014, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds