In Berkeley, California — the birthplace of PostgreSQL — it's spring: plum
and cherry blossoms, courting finches and college students, new plans for
the summer, and the first beta release of the database
system. Every year, the first beta of the next PostgreSQL version comes out
in April or May, for a final release in September. PostgreSQL
9.3 beta 1 was released to the public on May 13th, and contains a
couple dozen new features both for database administrators and application
Of course, if you're in the southern hemisphere, it's not spring for
you. Nor if you live in the north Midwest of the US or central Canada. Sorry about that; we just track the weather with PostgreSQL, we don't control it.
As usual, there are too many features in the new PostgreSQL to cover them
all individually, so we're going to go over just a few of them here:
database federation, writable foreign data sources, and the end of System V shared memory dependence.
Sharding and master-slave replication are the most common ways to horizontally scale a relational database, but they are not the only ones. A type of horizontal scaling which has been little explored among open source databases is federated databases. PostgreSQL 9.3 will introduce database federation as a standard feature of the database system, through the postgres_fdw extension, which will ship with the core code.
The idea of federated databases is that each database server can query other database servers on the network, effectively giving the user access to all databases and tables on all servers from any single database client. Tables can even be JOINed in queries with other tables from other servers. Federated databases are basically connected by the database engine at the query executor level. This type of horizontal scaling is mainly good for spreading reads of large amounts of data around several machines, and as such is primarily useful for data warehousing and analytics, rather than for transaction processing.
This concept has been implemented before, and is one of the most
distinctive features of IBM's DB2. However, implementations within open
source relational databases have not been extensively used. PostgreSQL has
had federated databases though Skype's PL/Proxy
extension since 2006, but that extension was never integrated into the
core, and it forces database access via stored procedures, which didn't
work for many users. MySQL introduced the FEDERATED
storage engine in version 5.0, but for some reason it seems not to have
been widely adopted as part of scalable MySQL solutions.
PostgreSQL introduced the foreign
data wrapper (FDW) feature for accessing external data at the query
level in version 9.1. That version's FDWs were read-only, however, and
queried external data by copying the entire target data source to memory on
the querying server, which necessarily limits how big of an external table
you could query. Since August of 2010, a development team
including major contributor Shigeru Hanada and committer Takahiro Itagaki,
worked on making FDWs something more, culminating in the new postgres_fdw
extension. Hanada described the use case and development of the feature as
I want to allow creating a cluster which contains multiple PG databases connected loosely with postgres_fdw.
Sometimes users want to have partial remote data available on the other database, but Streaming Replication
doesn't allow writable stand-by, and contrib/dblink is not easy for application developers. postgres_fdw
provides us with a nearly realtime view of remote database(s).
Creating a link to a table on another PostgreSQL server is relatively
simple, although it has multiple steps. First, import the postgres_fdw
CREATE EXTENSION postgres_fdw;
Then create a server-to-server link:
CREATE SERVER remotesrv foreign data wrapper postgres_fdw OPTIONS
( host '127.0.01', port '5433', dbname 'bench');
Create a mapping
for local database users to remote database users:
CREATE USER MAPPING FOR current_user SERVER remotesrv OPTIONS
( user 'josh', password 'password' );
Finally, link to the
tables on the remote server:
CREATE FOREIGN TABLE remoteacct (aid int, bid int, abalance int, filler char(84))
SERVER remotesrv OPTIONS ( table_name 'pgbench_accounts' );
Queries can be run against the tables on the attached server, including writing to it, just as if it were a table on the local server:
EXPLAIN SELECT * FROM remoteacct WHERE bid = 5;
INSERT INTO remoteacct ( aid, bid, abalance ) VALUES ( 10000000, 5, 100 );
The postgres_fdw extension permits the PostgreSQL query planner to "push
down" query clauses, such as WHERE clauses, to the remote database,
allowing for distributed, parallel execution of several portions of the
federated query. This eliminates the requirement to copy the entire
foreign table into memory, and with some architectures permits execution of
federated queries over much larger data sets than could be queried on one
server alone. More work needs to be done in this area to make this a full
"big data" solution, however; Hanada hopes to add "push down" of joins, sorts, and aggregates in future versions of PostgreSQL.
Linking PostgreSQL to Redis
The improved foreign data wrappers aren't limited to
PostgreSQL-to-PostgreSQL connections. FDWs can be used to connect
PostgreSQL to any kind of external data, as long as it can be rendered in
a tabular format: Oracle database tables, CSV files, process lists, or data
from non-relational databases such as Redis
and MongoDB, are all possibilities. Now that FDWs are writable as well as readable, PostgreSQL becomes much more useful as a data integration tool for multiple other databases and data sources.
In order to connect to a non-PostgreSQL data source, a developer needs
to write a special driver (also called an FDW), which is then installable
as a PostgreSQL extension. Existing FDWs will need to be enhanced to
support writability. One which is likely to be ready for read-write access
when PostgreSQL 9.3 is released is the Redis FDW. Andrew Dunstan,
PostgreSQL committer, is working on a new version of the driver, because he uses Redis together with PostgreSQL. Redis is a non-relational, memory-only database which stores hashes, lists, and sets.
Many people use Redis alongside PostgreSQL. It functions well as an
application cache, a buffer for rapid-fire writes, or to support queuing.
The Redis FDW allows users to pull data directly from Redis into PostgreSQL
without going through the application layer, saving both development time
and system resources. Dunstan describes the work:
Redis's response times are extremely fast. It isn't just a simple key value store. The values can be structured.
That makes it easier to fit them to a PostgreSQL table structure. In particular, a Redis hash is a good fit for
a row on a PostgreSQL table, and a Redis set is useful as more or less a table index, and via the keyset structure
a substitute for a "where" clause.
The Redis FDW works by mapping each of the Redis data types (scalar, hash, set, list, ordered set) into PostgreSQL
tables. Tables can be specified to occupy a subset of the global Redis keyspace, either by
the keys having a specified prefix, or by designating that the keys are stored in a named Redis set.
Redis can be attached as a foreign server as well. First you need to
install the redis_fdw extension, and create the foreign server, in this
case a Redis server on the same machine, and tell PostgreSQL what users
are allowed to access it, as you do with a Postgres-to-Postgres link:
CREATE EXTENSION redis_fdw;
CREATE SERVER localredis FOREIGN DATA WRAPPER redis_fdw;
CREATE USER MAPPING FOR public SERVER localredis;
Redis stores data in five forms: scalars, hashes, sets, lists and sorted
sets (zsets). These objects can be mapped to local PostgreSQL tables,
either singly or in groups. The code below, for example, makes a two-column
table out of all of the lists whose key begins with "jobqueue":
CREATE FOREIGN TABLE jobqueues(key text, list text)
OPTIONS (database '0', tabletype 'list', tableprefix 'jobqueue');
The future read-write Redis FDW will permit pushing data to Redis as well as reading it, enabling new ways of using Redis as an integrated cache or queuing store with PostgreSQL. For example, cache invalidation and refresh can be controlled using a database trigger, making invalidation much more discriminating about which data it needs to replace. Or PostgreSQL can automatically push items onto a Redis queue whenever certain events in the database happen.
No more "shmmax"
PostgreSQL's use of SysV shared memory has been a frequent source of irritation for system administrators. Every time an administrator installs PostgreSQL on a brand-new system, they have to edit the sysctl.conf file in order to raise the kernel limits on shared memory by substantially increasing "shmmax" and "shmall"; otherwise, PostgreSQL is limited to using a few megabytes of RAM. As of version 9.3, this headache will go away, thanks to committer Robert Haas:
The changes were isolated to just one file, sysv_shmem.c. I wrote the patch in just one day. The only real
hard part was figuring out what shared memory facilities existed that would be portable enough to serve our needs.
We haven't completely given up SysV RAM. PostgreSQL 9.3 will still allocate a small region of System V shared
memory, just a few bytes. This region is important because it allows us to protect against the catastrophic
situation where two unrelated sets of PostgreSQL processes access the same data directory at the same time.
This interlock relies on a feature of System V shared memory that does not seem to be offered by any other
commonly-available shared memory facility: the ability to determine the number of other processes which are
attached to a given shared memory segment. Until someone devises another, equally bullet-proof way of doing
this, we'll probably continue to rely on SysV shared memory at least for this small task.
After debating several other approaches, Haas ended up moving all of
PostgreSQL's dedicated memory to mmap(), using anonymous shared
memory regions. This avoids shared memory limits, and is a widely
supported interface that works on all of the operating systems supported by
PostgreSQL, except for Windows. PostgreSQL for Windows, however, has always
used a different system of memory allocation. The new mechanism works
without users needing to make any changes to their PostgreSQL
configurations. The one problem reported so far has been decreased
performance on BSD operating systems, due to the loss of special
optimizations those operating systems made for SysV shared memory.
Lots more features
As always, this PostgreSQL annual release has a lot more features to offer users. Included in the current beta are:
- A new CREATE MATERIALIZED VIEW declaration, which lets users generate precalculated summary tables using a simple SQL statement.
- Regular dynamic VIEWs (saved queries) are now automatically updatable as well as readable.
- A data page checksum option for users who need to detect disk integrity issues immediately.
- A new JOIN construct, LATERAL JOINs, which allow self-referencing table
lists; this is especially useful with functions or foreign data wrappers.
- Additional built-in JSON manipulation functions for the JSON data type.
- Indexed regular expression search to speed up text matching.
- Sub-second "fast failover" option when switching replicated servers, for 99.999% high availability.
Plus many other features, some of them unique to PostgreSQL. The community has documented some of them, so that you can see if there is something for you in this beta.
Inevitably, these features also introduce lots of new bugs, which is why
the PostgreSQL team wants you to download the beta and test it with your
applications. Like most community-driven open source projects, PostgreSQL
relies heavily on end-user testing to find bugs before the final release. The project will release multiple betas over the four month testing period. The final release for version 9.3 is expected to be sometime in September.
[ Josh Berkus is a member of the PostgreSQL Core Team. ]
to post comments)