Traditional SQL databases with "ACID" properties (Atomicity, Consistency, Isolation and Durability) give strong guarantees about what happens when data is stored and retrieved. These guarantees make it easier for application developers, freeing them from thinking about exactly how the data is stored and indexed, or even which database is running. However, these guarantees come with a cost.
presented a talk titled "Drop ACID and think about data" at PyCon 2009, which gave an overview of number of non-traditional databases. These alternatives compromise one or more of the ACID properties and expose the particulars of that data store's implementation in exchange for improved performance or scalability. Each also has its own limitations. This article will look at the more mature open source options Ippolito mentioned.
A number of companies have developed their own in-house data stores, including Amazon's Dynamo and Google's Bigtable. While none of the open source options are exactly like Dynamo or Bigtable, there are a number of high-performance, reliable and scalable options available.
|Alternative Database Language Support
|* means language support is pending
||C++, C#, Java, Perl, Python, PHP, Erlang, Ruby
||C/C++, C#, Java, Perl, Python, PHP, Ruby, Lua, OCaml, Common LISP
||C/C++, Java, Perl, Ruby, Lua
||C/C++, Java*, Perl, Python, PHP, Erlang, Ruby, Lua, Tcl
||C++, Java, Python, PHP, Erlang*, Ruby
Cassandra is a data store written in Java that was open-sourced by Facebook and is now part of the Apache Incubator. Cassandra was originally designed to solve Facebook's in-box searching problem. Email reverse indexes were growing much faster than their databases could keep up with and they needed a affordable way to continue to grow.
Cassandra is designed to scale inexpensively with commodity networking equipment and servers, possibly in multiple data centers. Scalability and high availability are achieved by automatically "sharding" and replicating data across the servers and data centers.
A single Cassandra instance stores a single table, and each row is accessed with a key string. Every row of this table can have its own structure, storing a huge number of (key, value, time-stamp) tuples and/or nested columns. This makes Cassandra much more flexible than a simple key-value store, but not as general as a document database.
Although in heavy use by Facebook, Cassandra is early in development and still lacks some polish and documentation.
Perhaps the simplest key-value store is
Memcached. Memcached is widely used to to speed up web applications by caching dynamic content. Part or all of the web pages are served from the cache instead of generating them at each request. Unlike in-process or shared memory caches, Memcached listens on a network socket and can be shared by many servers. Memcached may also be run on multiple servers and it will spread the keys across those servers and transparently fall back to servers that are still available when one goes down.
Memcached keys and values are always strings. In addition to storing, retrieving and deleting values, it allows atomic appending/prepending string data to stored values and addition to/subtraction from 64-bit integer values stored as decimal strings.
Memcached's data store is a fixed size and resides entirely in-memory. Data may be stored with an expiration time. Memcached will actively throw out data when the cache is full or when the data is set to expire.
For a key-value data store that won't throw out data, Tokyo Cabinet is a good choice. Like Berkeley DB, it uses either a hash table, B+ tree or a array of fixed-length records to store data on disk, but Tokyo Cabinet performs better and is thread safe. Tokyo Cabinet also promises to never corrupt data even in a "catastrophic situation". Tokyo Cabinet is actively maintained, and data stored is not limited by system RAM.
Tokyo Cabinet clients are separated into readers and writers. When a writer is accessing the database all other clients are blocked. This will result in poor performance for write-heavy workloads.
Tokyo Cabinet supports appending data to values stored. When using a B+ tree layout Tokyo Cabinet provides a cursor object to efficiently move forward and backward through the keys. B+ tree pages may also be compressed on disk with zlib or bzip2. Compressing data not only saves disk space, but can also increase performance on I/O-bound systems.
Redis is a disk-backed, in-memory key-value store with a number of additional features. Redis supports master-slave replication for redundancy, but not sharding, so all data must fit in a single system's RAM. Redis values may be binary strings, lists or sets. Redis provides atomic addition to/subtraction from integer values stored as decimal strings and push/pop/replacement of values in lists. The intersection of set values stored may also be calculated.
Redis can asynchronously save the database on request by forking the server process and writing out data in the background. The last successful save time may be queried to check when the changes have made it to disk. This design allows for good performance with the ability to save data when it makes sense for the particular application, but the application is responsible to make sure data is properly saved.
When using any key-value store as a cache care must be taken to invalidate values when the data changes or inconsistency will be introduced. Choosing a memory-only cache will be faster once it is populated, but there is a cost associated with filling an empty cache on restart. Key-Value stores are ideal for storing data that is not deeply nested and does not require complicated locking.
Document databases are designed to store large blocks of semi-structured data. The data is not restricted to a particular schema, so new versions of data can be stored alongside old versions without the need for migrations. Documents can be very large and deeply nested.
CouchDB is a
document database written in Erlang. CouchDB gives access to the database over HTTP with a
CouchDB documents are stored with a sequence number and are never overwritten, this way partial writes will never result in data corruption. Readers are never blocked by writers and will always see a consistent snapshot of the database while reading data. The data is periodically compacted by writing out a new data file and deleting the original once it is no longer being accessed.
CouchDB uses peer based asynchronous replication. Documents may be updated on any peer, allowing for good write throughput. Conflicts will occur when two clients update the same document, and multiple conflicting documents may coexist in the database. A deterministic method is used to decide which document will be treated as the latest version. This lets CouchDB leave conflict resolution to the application. Once a conflict is resolved the new version is stored in the database as usual.
MongoDB is a document database written in C++.
MongoDB uses a binary-encoded JSON format that shrinks the data size and allows for faster searching and indexing. Large binary data, such as video files, can also be stored more efficiently in this format. Data is updated in place and MongoDB will automatically run a repair procedure on the database in the event of an unclean shutdown.
MongoDB documents may be nested or include references to other documents. References will be replaced with the value of the referenced document when queried. MongoDB supports persistent single or compound key indexes. Indexes are implemented as B-Trees and queries will automatically take advantage of all indexes available. Queries may include common conditional operators, membership testing and values in embedded documents.
MongoDB has auto-sharding support, splitting documents across many servers so that data stored is not limited by the capacity of a single server. MongoDB also supports asynchronous replication for high availability.
Choosing a Data Store
The best data store for an application depends in large part on how deeply nested the values stored will be. If the application needs to only store strings and integers then a simple key-value store like Memcached, Tokyo Cabinet or Redis would be best. If the values can be represented as lists and sets of simple values then Tokyo Cabinet, Redis or Cassandra would be good options. If the application needs nested lists and hashes then choose Cassandra, CouchDB or MongoDB. Finally, if the values contain deeply nested data then only a document database like CouchDB or MongoDB will do.
Once a data store has been chosen and and the application optimized for it, switching to a completely different API will not be easy. It is worth investing time evaluating the remaining options by writing code to simulate the application's usage patterns before making a choice.
to post comments)