April 14, 2010
This article was contributed by Nathan Willis
MongoDB is an open source document-oriented database system that is designed for speed and scalability in web site data operations, bridging the gap between simple "key/value" structured storage and the heavyweight requirements of relational database systems. Like other databases in the so-called "NoSQL" vein, MongoDB trades in full ACID compliance for the ability to solve a smaller set of problems easily and quickly.
MongoDB theory
MongoDB's data sets are called "collections" and are roughly analogous to the tables in a traditional relational database. Unlike relational database tables, however, they have no predefined structure (or schema, to use the canonical term) — each record in the collection is a "document" that can potentially have a different structure than every other document in the collection.
This is not to say that MongoDB documents are unstructured, of course; they use a key-value pair syntax modeled on the popular JavaScript Object Notation (JSON) format. MongoDB calls this syntax BSON (alternately expanded as "Binary JSON" and "Binary Serialized dOcument Notation"), and it is designed to be easily traversed, easily coded-to, and lightweight — enough so that it is also MongoDB's network transfer format. Document keys are strings, and values can be variety of types including strings, arrays, and even other documents.
For example, a JSON object such as
{
"firstName": "Nathan",
"lastName": "Willis",
"Url": "http://www.freesoftwhere.org"
}
would appear quite simply as the document:
{"firstName" : "Nathan" , "lastName" : "Willis" , "Url": "http://www.freesoftwhere.org" , \
"_id" : ObjectId(497cf6075172cf775cace8fb)}
in a MongoDB collection.
MongoDB's query language is also based on the BSON syntax, so data can be fetched with simple expressions such as db.users.find({'lastName': 'Willis'}) or sorted with db.users.find({}).sort({lastName: 1}). All of MongoDB's queries are dynamic, however, meaning that clients can query the database on any key, without first having to calculate a "view" that indexes the data based on a particular key. This is different from other document-oriented databases, such as CouchDB, which can perform only static queries.
The conceptual differences between MongoDB's schema-free documents and a
traditional relational database produce some limitations, but also enable
some real-world speed optimizations. Developer Richard Kreuter
described MongoDB in a talk at Texas Linux Fest on April 10. He
said that because documents are schema-free, the database can be designed to store information commonly accessed in a serial fashion within a single document — for example, a blog post's content and all of the reply comments. 99 percent of the time, he said, they will be retrieved in precisely that order. By not storing the post, user names, and comments in separate tables, access is substantially sped up. The only cost is loss of the comparatively-infrequently-needed ability to atomically update the post and the comments simultaneously from different database clients.
The project lists web site content management, real-time analytics, caching, and logging as ideal use cases for MongoDB. Highly transactional systems, on the other hand, are a poor fit, as the MongoDB server can enforce transactionality only on operations that touch a single document.
In addition to its overall document-centric design, MongoDB also offers several interesting features that database application developers are likely to find convenient. One example is the "upsert" operation, which updates an object in a database document if the object already exists, and creates it if it does not exist. Another example is "capped collections," in which a collection is created with a fixed size, and the oldest entries are automatically removed. Capped collections allow a collection to automatically retain order, but free the developer from having to manually "age-out" the oldest objects by tracking their timestamps.
MongoDB deployment and administration
MongoDB is developed primarily at 10gen, a company which offers commercial
support contracts and training for MongoDB administration and development.
The latest release is version
1.4, from March 22, 2010, and is under the AGPL version 3. The project
provides
packages for 32-bit and 64-bit versions of x86 Linux, Solaris, Windows, and Mac OS X, as well as an Apt repository for Debian and Ubuntu.
The main MongoDB server runs as the mongod process. Packages include a shell interpreter interface called mongo, which uses JavaScript as its command language — most of the documentation and tutorials on he Mongo web site use this interface for their examples. Language drivers are available for C, C++, Python, Java, and Perl clients in the official packages, and C#, REST, ColdFusion, Ruby, PHP, JavaScript, and several others in community-supported add-ons.
Mongo supports several replication configurations, including the usual master-slave, as well as "replica sets" that automatically negotiate which database server functions as the master at a given point in time. Master-master replication is supported only in a limited fashion.
Mongo is designed to be highly horizontally scalable, supporting database cluster functionality like failover, map/reduce, and sharding. The current release supports auto-sharding, in which a routing process called mongos interacts with the client in order to abstract away the actual cluster of mongod servers.
Because Mongo does not support transactions in the sense that relational databases support them, it does not support transaction logs that enable database repair — the only real protections against data loss are backups and replication. One other feature worth noting is that the current releases of Mongo only support username-and-password authentication that grants read-write or read-only access to a particular database. Deployments that need stronger security or more fine-grained access control may not find Mongo a good fit.
Still, there are plenty of large-scale production MongoDB servers in the wild — most notably the web, project, and download pages on SourceForge.net, the GitHub service, and the Disqus blog-discussion-system. Those examples and the others listed on the Mongo "production deployments" page all seem to fit broadly into the problem space that Mongo is optimized for: "high-volume, low-value data" web sites, which have little need for the transactional requirements that a relational database system like MySQL provides. If your site also fits the pattern, MongoDB deserves a close look.
Comments (5 posted)
Brief items
Developing the release process is almost as hard as developing the
code.
--
Keith Packard
Think Shakespearean and you can get an accurate count though:
notmuch count to be or not to be
OK, we need a simpler search syntax than that...
--
Carl Worth
Comments (none posted)
Bricolage is a content management system aimed at
organizations with large amounts of content; the 2.0 release has been
announced. Changes include a reworked interface ("
The
amazingly-flexible Bricolage approach to document editing is now also
amazingly easy to work with"), a number of backend improvements, and
more.
See
the
changelog for details.
Full Story (comments: none)
Version 0.7.0 of the GNUmed medical records management package is out. It
has a number of new features which are certainly unique to this type of
software ("
manage date of death per patient"), but the core
feature seems to be "
a rather unexpected new functionality" in
the form of visual progress notes. See
this
posting for more information.
Full Story (comments: none)
IcedTea is a Java development kit build done entirely with open source
tools. The 1.8 release is out; it includes an OpenJDK update, but the key
aspect of this release would appear to be the fixing of a discouragingly
large number of security issues.
Full Story (comments: 3)
The Perl 5.12.0 release is out; it marks the transition to a time-based
release process for Perl 5, where a major release will happen each
(northern-hemisphere) spring. Changes in this release include better
Unicode support, some new APIs to make it even easier to extend the
language, a solution to the Y2038 problem, a "yada yada
operator," and more; see
this
page for a detailed list.
Full Story (comments: 18)
The Python development team has announced the first beta release of Python
2.7. Python 2.7 is likely to be the last major version in the 2.x series,
although more major releases have not been absolutely ruled out.
"
2.7 includes many features that were first released in Python 3.1.
The faster io module, the new nested with statement syntax, improved float
repr, set literals, dictionary views, and the memoryview object have been
backported from 3.1. Other features include an ordered dictionary
implementation, unittests improvements, a new sysconfig module, and support
for ttk Tile in Tkinter."
Full Story (comments: none)
A new major revision of the WebKit rendering engine has been posted by
Apple. "
WebKit2 is designed from the ground up to support a split process model,
where the web content (JavaScript, HTML, layout, etc) lives in a separate
process. This model is similar to what Google Chrome offers, with the
major difference being that we have built the process split model directly
into the framework, allowing other clients to use it."
Unfortunately, it lacks a Linux port at the moment, but one assumes that
can be fixed.
Full Story (comments: none)
Xen has released the Xen hypervisor 4.0.0.
See the
release
notes for more information. "
Xen 4.0 includes and builds the new pvops dom0 Linux 2.6.31.x kernel as a default. There's also long-term supported Linux 2.6.32.x based pvops dom0 kernel tree available. You can also use the old-style linux-2.6.18-xen as the dom0 kernel, or any of the various forward-ports of the 2.6.18 xen patches to newer kernels."
Comments (4 posted)
Newsletters and articles
Comments (none posted)
Over at LinuxPlanet, Akkana Peck
looks at Upstart, which is rapidly supplanting System V init for many distributions. "
Upstart, in contrast, is event based. An 'event' can be something like 'booting' ... or it can be a lot more specific, like 'the network is ready to use now'. You can specify which scripts depend on which events. Anything that isn't waiting for an event can run whenever there's CPU available.
[...]
This event-based system has another advantage: you can theoretically use it even after the system is up and running. Upstart is eventually slated to take over tasks such as or plugging in external devices like thumb drives (currently handled by udev and hal), or running programs at specific times (currently handled by cron)."
Comments (46 posted)
Page editor: Jonathan Corbet
Next page: Announcements>>