LWN.net Logo

MongoDB: leave your SQL at home

April 14, 2010

This article was contributed by Nathan Willis

MongoDB is an open source document-oriented database system that is designed for speed and scalability in web site data operations, bridging the gap between simple "key/value" structured storage and the heavyweight requirements of relational database systems. Like other databases in the so-called "NoSQL" vein, MongoDB trades in full ACID compliance for the ability to solve a smaller set of problems easily and quickly.

MongoDB theory

MongoDB's data sets are called "collections" and are roughly analogous to the tables in a traditional relational database. Unlike relational database tables, however, they have no predefined structure (or schema, to use the canonical term) — each record in the collection is a "document" that can potentially have a different structure than every other document in the collection.

This is not to say that MongoDB documents are unstructured, of course; they use a key-value pair syntax modeled on the popular JavaScript Object Notation (JSON) format. MongoDB calls this syntax BSON (alternately expanded as "Binary JSON" and "Binary Serialized dOcument Notation"), and it is designed to be easily traversed, easily coded-to, and lightweight — enough so that it is also MongoDB's network transfer format. Document keys are strings, and values can be variety of types including strings, arrays, and even other documents.

For example, a JSON object such as

    {
        "firstName": "Nathan",
        "lastName": "Willis",
        "Url": "http://www.freesoftwhere.org"
    }
would appear quite simply as the document:
    {"firstName" : "Nathan" , "lastName" : "Willis" , "Url": "http://www.freesoftwhere.org" , \
     "_id" : ObjectId(497cf6075172cf775cace8fb)} 
in a MongoDB collection.

MongoDB's query language is also based on the BSON syntax, so data can be fetched with simple expressions such as db.users.find({'lastName': 'Willis'}) or sorted with db.users.find({}).sort({lastName: 1}). All of MongoDB's queries are dynamic, however, meaning that clients can query the database on any key, without first having to calculate a "view" that indexes the data based on a particular key. This is different from other document-oriented databases, such as CouchDB, which can perform only static queries.

The conceptual differences between MongoDB's schema-free documents and a traditional relational database produce some limitations, but also enable some real-world speed optimizations. Developer Richard Kreuter described MongoDB in a talk at Texas Linux Fest on April 10. He said that because documents are schema-free, the database can be designed to store information commonly accessed in a serial fashion within a single document — for example, a blog post's content and all of the reply comments. 99 percent of the time, he said, they will be retrieved in precisely that order. By not storing the post, user names, and comments in separate tables, access is substantially sped up. The only cost is loss of the comparatively-infrequently-needed ability to atomically update the post and the comments simultaneously from different database clients.

The project lists web site content management, real-time analytics, caching, and logging as ideal use cases for MongoDB. Highly transactional systems, on the other hand, are a poor fit, as the MongoDB server can enforce transactionality only on operations that touch a single document.

In addition to its overall document-centric design, MongoDB also offers several interesting features that database application developers are likely to find convenient. One example is the "upsert" operation, which updates an object in a database document if the object already exists, and creates it if it does not exist. Another example is "capped collections," in which a collection is created with a fixed size, and the oldest entries are automatically removed. Capped collections allow a collection to automatically retain order, but free the developer from having to manually "age-out" the oldest objects by tracking their timestamps.

MongoDB deployment and administration

MongoDB is developed primarily at 10gen, a company which offers commercial support contracts and training for MongoDB administration and development. The latest release is version 1.4, from March 22, 2010, and is under the AGPL version 3. The project provides packages for 32-bit and 64-bit versions of x86 Linux, Solaris, Windows, and Mac OS X, as well as an Apt repository for Debian and Ubuntu.

The main MongoDB server runs as the mongod process. Packages include a shell interpreter interface called mongo, which uses JavaScript as its command language — most of the documentation and tutorials on he Mongo web site use this interface for their examples. Language drivers are available for C, C++, Python, Java, and Perl clients in the official packages, and C#, REST, ColdFusion, Ruby, PHP, JavaScript, and several others in community-supported add-ons.

Mongo supports several replication configurations, including the usual master-slave, as well as "replica sets" that automatically negotiate which database server functions as the master at a given point in time. Master-master replication is supported only in a limited fashion.

Mongo is designed to be highly horizontally scalable, supporting database cluster functionality like failover, map/reduce, and sharding. The current release supports auto-sharding, in which a routing process called mongos interacts with the client in order to abstract away the actual cluster of mongod servers.

Because Mongo does not support transactions in the sense that relational databases support them, it does not support transaction logs that enable database repair — the only real protections against data loss are backups and replication. One other feature worth noting is that the current releases of Mongo only support username-and-password authentication that grants read-write or read-only access to a particular database. Deployments that need stronger security or more fine-grained access control may not find Mongo a good fit.

Still, there are plenty of large-scale production MongoDB servers in the wild — most notably the web, project, and download pages on SourceForge.net, the GitHub service, and the Disqus blog-discussion-system. Those examples and the others listed on the Mongo "production deployments" page all seem to fit broadly into the problem space that Mongo is optimized for: "high-volume, low-value data" web sites, which have little need for the transactional requirements that a relational database system like MySQL provides. If your site also fits the pattern, MongoDB deserves a close look.


(Log in to post comments)

Speeding up by dropping support for some transactional updates

Posted Apr 15, 2010 12:42 UTC (Thu) by epa (subscriber, #39769) [Link]

By not storing the post, user names, and comments in separate tables, access is substantially sped up. The only cost is loss of the comparatively-infrequently-needed ability to atomically update the post and the comments simultaneously from different database clients.
This is interesting. I wonder if the same kind of optimization could be applied to SQL databases, by declaring somehow in the schema definition that it won't be necessary to update individual rows atomically (or at least that this is a very rare operation, and need not be fast). Perhaps the DBMS could even analyse a log of the commonly executed queries and optimize its table storage and locking to make those go fast.

Speeding up by dropping support for some transactional updates

Posted Apr 15, 2010 19:09 UTC (Thu) by hathawsh (guest, #11289) [Link]

MySQL lets you mix different storage engines within a schema, giving the effect you're describing. RelStorage, a ZODB backend that stores Python objects as pickles, uses this feature when storing in MySQL. It helps performance significantly.

Speeding up by dropping support for some transactional updates

Posted Apr 17, 2010 3:23 UTC (Sat) by dmag (subscriber, #17775) [Link]

> stores Python objects as pickles

That's more like a key-value store. One big problem: For any operation on an 'object', you must download the entire object. (I.e. let's say each object is a blog post with metadata and all comments. You would have to download all comments for all blog posts just to list the titles of the blog posts. Ick.)

With a docstore, you can have the database scan all the objects and pick out the field you want. It saves a ton of bandwidth, and client-side processing. (Mongo saves a lot of server-side processing by using BSON for disk storage and wire-protocol.)

MongoDB: leave your SQL at home

Posted Apr 16, 2010 16:52 UTC (Fri) by zooko (subscriber, #2589) [Link]

I don't understand how MongoDB supports "replica sets", in which it automatically negotiates which server is master, and "fail-over", if it doesn't support "master-master replication". That is: if the current master fails, and a new master is automatically chosen to take over, then how can it do that if it doesn't have a replica (at least *mostly* up to date) of what the old master had?

MongoDB: leave your SQL at home

Posted Apr 17, 2010 3:45 UTC (Sat) by dmag (subscriber, #17775) [Link]

> I don't understand how MongoDB supports "replica sets", in which it automatically negotiates which server is master, and "fail-over", if it doesn't support "master-master replication".

Those are different things:

In master-master replication, either side can take writes. This is the CouchDB model. The problem is that you get more "eventual" in your eventual consistency. (i.e. you might not read your writes: You can write to one system and read from the other.)

In a replica set, MongoDB allows you to set up 2 "masters", but they flip a coin and one becomes the slave. If the master dies, the slave takes over. Since there's only one master at a time, the system is consistent (all R/W to the master.)

The setup works out to be a lot like MySQL (master/slave with async replication), but with less work. MySQL forces you to micro-manage each box to tell it when it's a master or slave. In fact, Mongo in general feels like a regular database, but without the schema.

I also like the MongoDB philosophy of not worrying about disk. Who wants to wait for their server to reboot and do fsck and then hope that the rarely-used database repair code actually works?

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds