User: Password:
|
|
Subscribe / Log in / New account

LWN.net Weekly Edition for May 22, 2014

PostgreSQL 9.4 beta: Binary JSON and Data Change Streaming

May 21, 2014

This article was contributed by Josh Berkus

It's May, which means that it's time for a new PostgreSQL beta release. As with each annual release, PostgreSQL 9.4 has a few dozen new features addressing the various ways people use the database system. Among the features ready for testing are: index and aggregation performance improvements, materialized views, ALTER SYSTEM SET, dynamically loadable background workers, new array-manipulation syntax, security barrier views, and time-delayed standbys. While users all have their own favorites among the new features, this article will focus on two features that have received the most attention: the new JSONB type, and Data Change Streaming.

JSONB

JSON, which stands for "JavaScript Object Notation", is a serialization format originally designed to allow JavaScript programs to store stuff in disk or cache. It is a dialect of YAML, and consists of keys, values, and lists. Over the last few years JSON has become a kind of lingua franca for data interchange between programs written in different languages, largely replacing XML for new applications.

PostgreSQL 9.4 introduces a new data type called "JSONB" for "binary JSON". The name is somewhat deceptive; since JSON is a text serialization format, it's not binary anything. What JSONB does is implement storage for semantic JSON in a specialized compressible tree structure based on the PostgreSQL extension HStore. This binary structure allows the implementation of new features that make PostgreSQL JSON much more useful: indexing, transformation, sorting, path search, and matching keys and values. It also means better performance on some operations.

Version 9.2 introduced JSON text support to PostgreSQL. However, that data was stored as text, which had several disadvantages. All operations had to re-parse the JSON, even purely internal ones. It was also impossible to sort JSON values, or to determine the equivalence of two JSON strings since order and white space were preserved. For backward compatibility reasons, however, the original text JSON type will be retained in PostgreSQL, which requires users to choose between the two types. The old JSON type will continue to be useful for users who need to preserve white space and key ordering.

Indexing JSONB

The biggest feature advantage of JSONB is the ability to create general indexes on columns of JSONB values, which can then be used for almost any search on that data. These indexes use GIN (Generic INverted) indexes, which have also received a substantial (50% or more) performance boost for 9.4. As an example, imagine that we have this table of publication data for a set of books:

    Table "public.booksdata"
      Column  |  Type
    ----------+--------
     title    | citext
     isbn     | isbn
     pubinfo  | jsonb

We can create a GIN index on the bookdata column:

    CREATE INDEX booksdex ON booksdata USING gin_hash_ops (pubinfo);

And then do path queries and matching against that data, which will use the index:

    SELECT title, isbn, pubinfo #> '{ "whlscost" }' as cost
    FROM booksdata
    WHERE pubinfo @> '{ "publisher" : "Avon" }';

As with other exotic data types, the new features are supported by a mix of functions and operators which are available inside the SQL interface. For example, "@>" used above means "contains", and "#>" means "extract path". In this case, a "path" is a chain of hierarchical keys, such as "publisher, format, edition". Path queries will become even more useful in PostgreSQL 9.5, when wildcards will be supported.

The new type also supports doing element extraction, comparisons, and sorting, which allows treating JSON values like other kinds of data values. For example, given a set of heterogeneous JSON strings, we can get them to sort in a way which makes intuitive sense:

    SELECT some_json FROM sortjson
    ORDER BY some_json;
	   some_json
    ------------------------
    {"a": 1, "b": "carol"}
    {"a": 1, "b": "mel"}
    {"a": 3, "b": "alice"}
    {"b": "alice", "c": 2}

Version 9.4 also introduces several aggregation functions, which let you "roll up" JSON from an entire column:

    SELECT title,
      json_agg(pubinfo #>> '{ "published_on" }') as pub_dates
    FROM booksdata
    WHERE title = 'Sphere' group by title;

     title  |           pub_dates
    --------+------------------------------
     Sphere | ["1999-03-04", "1998-07-04"]

This takes the publication date from a set of JSON documents, and aggregates them into a JSON array.

Utility and purpose of JSONB

There have been many questions inside and outside the PostgreSQL community (for example, on Reddit) as to why an advanced relational database is implementing functionality more associated with non-relational, or "NoSQL", databases. Some commenters, in particular, see JSONB as an abandonment of the relational model. So, why did the project implement it?

The first answer is that, historically, the PostgreSQL project has continually added new data types to support new kinds of data which need storage, indexing, and manipulation. Over the last 16 years, this has included IP address types, arrays and matrices, spatial data types, XML data, and others. So adding support for storing data in JSON format is just a continuation of that.

Perhaps more importantly, JSON supports having the application add new attributes to the data at runtime. One of the limitations of the SQL model has long been the lack of a safe way for a user application to add columns in response to a user action, such as data from a configuration control panel. PostgreSQL 8.3 previously bridged this gap with the HStore indexed key-value data type. JSONB is a furtherance of that concept that supports not only key-value data but also hierarchical data.

Existing PostgreSQL users also wanted richer JSON features to support their use of JSON as an output format and an API for web applications that interact with the database. This allows construction of simple two- or three-tier web and mobile applications that get back data as JSON. This data can then be passed through directly to a JavaScript client without further manipulation. In particular, the new operators and functions that come with JSONB will enable building more of these kinds of applications, faster.

Comparisons to NoSQL

Of course, the other big reason to implement indexed JSON storage inside PostgreSQL is competition with new, non-relational databases — the so-called NoSQL databases. Many of these databases use JSON as a storage or API format, which is appealing to web developers who are already familiar with it.

With JSONB, PostgreSQL now implements a large chunk of the JSON data manipulation features available in databases like MongoDB and CouchDB, with comparable performance. This is all implemented without sacrificing the database's relational capabilities, reliability, or multi-core performance. The project is betting that it can lure back users who like JSON but are unhappy with some of the limitations of the new databases.

The PostgreSQL project will need to implement more to become really attractive for use as a NoSQL database, though. A fast API that doesn't require SQL would be the first step, and projects like pgRest, Mongolike, and Mongres have all been working on this. A second requirement is much more difficult: sharding and horizontal scalability; this may possibly be addressed by the recently-released PostgresXL.

Other features remain on the "to do" list for PostgreSQL's JSON support that the developers plan to address, including: offering ways to update a single key in a large JSON document, wildcard support for path queries, and faster and more versatile indexing. That last item will be addressed as part of the VODKA project (yes, really), which will be launched at pgCon this week. One only wonders how long it will take PostgreSQL to support WHISKEY and RUM indexing as well.

Data Change Streaming

PostgreSQL's built-in replication has proved to be extremely popular and robust, allowing many users to improve redundancy and scale. However, it is limited to replicating the entire database in only one direction, which constrains the kind of scale-out architectures users can build with it. It also requires both master and replica to be running the same PostgreSQL version, preventing upgrade-by-replication.

There have been other systems to work around this, such as Slony-I, Bucardo, and Londiste. However, all of these limit both throughput and the changes application developers are allowed to make to the database. As a result, they have been unattractive for large scale-out infrastructures.

Data Change Streaming, a 9.4 feature added by 2nd Quadrant developer Andres Freund, provides a C API for listening to PostgreSQL's binary replication stream and extracting row changes and SQL statements from it. This feature has also been called "Changeset Extraction" and "Logical Decoding" at various points. This release also includes pg_recvlogical, a command-line utility that connects to PostgreSQL replication and writes out data changes to STDOUT or a file.

Data Change Streaming can become a game-changing feature for PostgreSQL, allowing development of sophisticated multi-directional replication (otherwise known as "multi-master"), and automated sharding systems, without sacrificing per-node performance. That will require the construction of an entire layer of tools on top of Data Change Streaming before most users are able to utilize it. The Slony-I project is already writing code to use the new API.

The biggest "to do" for the next version is to find a way to capture Data Definition Language (DDL) statements, which are the commands the user sends to create and modify table designs that are not represented in the data change stream in 9.4. Fixing this is a high priority for 9.5, so that the replication system doesn't interfere with continuous integration and development pushes, which may include such statements.

Other Features

As with other PostgreSQL releases, this one contains a large number of other features. Among them are:

ALTER SYSTEM SET: this new statement allows setting PostgreSQL configuration variables in the configuration file over a database connection. This enables easier auto-tuning and management of large numbers of PostgreSQL instances.

REFRESH MATERIALIZED VIEW CONCURRENTLY: this allows users to append or rebuild "materialized views" in the background while other users are still querying the old version. As materialized views are large, complex reporting queries whose results have been stored for quick reference, this will make PostgreSQL more useful as an analytics and decision support database.

Dynamic background workers: version 9.3 introduced the idea of the "background worker", a daemon that would start and stop with PostgreSQL and handle background tasks. Now, 9.4 adds the ability to make these workers dynamically loadable, which means that they can be launched in response to server tasks, permitting asynchronous activity, parallelism, and deferred maintenance.

This is just the first beta release, but the list of features is expected to be stable between this and the final release. Multiple betas will be released over the next four months, culminating in a final release sometime in September. Development on PostgreSQL version 9.5 will start in June.

[ Josh Berkus is a member of the PostgreSQL core team. ]

Comments (25 posted)

US Supreme Court decisions make patent trolling riskier

May 21, 2014

This article was contributed by Adam Saunders

While the recent decision for Oracle on the copyrighting of APIs may be distressing to software developers, the Supreme Court of the US (SCOTUS) offered some comfort on a different issue a few weeks ago. The court dealt a significant blow to patent trolling.

Exhibiting an awareness of frivolous litigation plaguing the patent system, SCOTUS chose to hear oral arguments in two cases — Octane v. ICON, and Highmark v. Allcare — that focused on the awarding of legal fees for victorious defendants of weak-to-completely-baseless lawsuits for patent infringement. We looked at the cases in March.

Toward the end of April, SCOTUS made two 9–0 rulings in these cases. And these rulings will likely deter much frivolous patent litigation because they effectively create a threat of major financial loss to an unsuccessful plaintiff. For example, suppose a troll's business model relies on getting settlements of several tens of thousands of dollars from numerous defendants. If someone it threatens stands up to it, and gets a judgment that includes hundreds of thousands of dollars in lawyer's fees, that could make a major impact on the troll.

Writing for a unanimous court, Justice Sotomayor found [PDF] in favor of Octane in Octane v. ICON. The issue in the case was the standard by which a "court in exceptional cases may award reasonable attorney fees to the prevailing party." Sotomayor began by tracing the history of the rules for attorney's fee awards in patent litigation. The most recent change to the rules, Section 285 of the Patent Act, essentially inserted two words — "exceptional cases" — into those rules. Sotomayor noted that SCOTUS had previously ruled that those two words merely clarify the rules.

Following the addition of an appellate court for all patent matters in the US — the Court of Appeals for the Federal Circuit (CAFC) — in 1982, the status quo was largely upheld for over twenty years; that is, "the Federal Circuit [...] instructed district courts to consider the totality of the circumstances when making fee determinations under §285" (page 5). But when the CAFC came across a particular case nine years ago — Brooks Furniture v. Dutailier — it decided on its own to implement a new standard: a defendant could only get attorney's fees if the lawsuit was done "in subjective bad faith and [...] [it was] objectively baseless" (page 8).

The Supreme Court has now cast aside that restrictive standard. After looking at dictionary definitions of "exceptional", SCOTUS decided on this standard (pages 7–8):

That an 'exceptional' case is simply one that stands out from others with respect to the substantive strength of a party's litigating position (considering both the governing law and the facts of the case) or the unreasonable manner in which the case was litigated. District courts may determine whether a case is 'exceptional' in the case-by-case exercise of their discretion, considering the totality of the circumstances.

Accordingly, SCOTUS reversed the CAFC's ruling, and ordered that the case go back to the lower court to resolve the attorney's fee question by following the new standard SCOTUS had established.

As the other case, Highmark v. Allcare, also dealt with attorney's fees and Section 285, the ruling [PDF] was short. The particular issue in the case was how much deference appeals courts should give to district courts that award attorney's fees in patent infringement cases. If higher courts must defer to lower courts on these rulings, it could deter frivolous litigators, because it closes off an avenue for them to keep the threat of a lawsuit alive.

The CAFC ruled that no deference should be awarded to the lower courts. Speaking again for a unanimous SCOTUS, Sotomayor reversed this decision in light of the Octane ruling: "Because §285 commits the determination whether a case is 'exceptional' to the discretion of the district court, that decision is to be reviewed on appeal for abuse of discretion" (page 4). This effectively means that a higher court reversing an award of attorney's fees will become quite uncommon. As with Octane, SCOTUS ordered the case back to the lower courts, and for those courts to apply the SCOTUS ruling.

These rulings have already sent some waves through the patent-litigation world. Several experienced patent litigators have expressed a belief that the anti-patent-troll bill before the US Senate will now likely die because it also centered around the issue of attorney's fee awards. Some have suggested that, with the broad discretion now clearly granted to them, district courts will feel much more confident in awarding attorney's fees to successful defendants of frivolous patent litigation. Kristen Fries, an experienced patent attorney, stated on the popular patent blog "Anticipate This" that the rulings "may aid in thwarting certain 'patent trolls' from asserting patent claims that are meritless or brought in bad faith."

The US software industry may be able to relax a little. After these rulings, some potential malicious litigators may need to rethink their strategy. That could lead to fewer weak patent suits, which would at least be a step in the right direction.

Comments (17 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

  • Security: XMPP switches on mandatory encryption; New vulnerabilities in chromium, kernel, mcrypt, qemu, ...
  • Kernel: Tux3; 2038 is closer than it seems; Eudyptula Challenge; BPF.
  • Distributions: Fedora mulls providing a local DNSSEC resolver; Ubuntu 12.10 eol, ...
  • Development: A look at Qt 5.3; Wayland/Weston 1.5.0; Introducing GtkInspector; The desktop and the developer; ...
  • Announcements: FCC votes for Internet “fast lanes”, interview with Linus Torvalds, events.
Next page: Security>>

Copyright © 2014, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds