Thanks to advances in storage, processing power, affordability of servers,
and above all the rapidly increasing number of software applications and
monitoring tools businesses seems to have these days, everyone seems to be
drowning in data. Many system administrators, application developers, and
junior database administrators are finding themselves doing "big data"
— ready or not — and have no familiarity with the terminology,
ideas, or tools in the field of data warehousing. Given that it's been
around for 40 years as a field, there's quite a bit to know. This article
will serve as a starting point for those wanting to learn the basics of
What is data warehousing?
Data warehousing isn't quite the same thing as "big data"; rather, it's a subset. Big data also includes systems which need to process millions of small requests per second across a large number of connections, such as Amazon's Dynamo database. Data warehousing generally refers to systems which accumulate data over large periods of time, and need to process a relatively small number of very large data requests.
Data warehousing really answers one of three different questions for organizations: archiving, data mining, or analytics. Sometimes it answers a combination of the three.
"I want to store large amounts of data for long
periods of time and might never query any of it."
Thanks to HIPAA, Sarbanes-Oxley, the 8th Company Law Directive, and other
regulations, this is an increasingly popular (or, at least, necessary) kind of data. It can also be
called "WORN" data, for Write Once Read Never. Companies accumulate large volumes of data which they don't really want, but can't throw away and theoretically need to be able to access in some reasonable amount of time. Storage sizes can range from gigabytes to terabytes.
A good example of this kind of system is the archive for European cell phone call completion data I helped build for Comptel and Sun Microsystems. With a projected size of 75 terabytes of data per city, each database was expected to answer less than one information request per week. This allowed us to build it using a very inexpensive combination of PostgreSQL, Solaris, and filesystem compression.
If you're building an archive, your only requirements are to minimize storage cost and to make sure the archive can keep up with the generation of data. Generally this means compression, and a database which works acceptably on large cheap hard drives or even tape libraries. Query speed and features are not a concern. Interestingly, this is the one type of database for which there are no well-known packaged open source solutions.
"I'm accumulating gigabytes of data a day and I know
there's valuable information in there, but I don't know what it is."
This is probably the most common type of data warehouse; most businesses and web sites generate large amounts of data as a side effect of operation. However, most don't have any clear idea how to utilize this data, they just know that they want to utilize it somehow. More importantly, the structure and meaning of a lot of the data may not be completely known; the data may be full of documents, unknown fields, and undefined categories. Data sizes are generally in the terabytes to petabytes. This is often referred to as "semi-structured" data.
Web clickstream analytics is probably the classic example of this kind of
application. Generally data comes in as a mixture of structured data and
message text from the web server logs and cookies. Companies accumulate this data because they can, and then gradually build up sets of queries and reports to try to get useful trending data out of the database.
Desired qualities in a data mining solution are the ability to perform CPU and I/O intensive searches, sorts, and computations as rapidly and efficiently as possible. Parallelism, either over multiple processors or multiple servers, is highly desirable. As a secondary concern, data mining databases often have to accept data at very high rates, as much as gigabytes per minute.
"I have large volumes of highly structured data which I want to use to produce visualizations in order to support business decisions."
Businesses also often generate data they understand very well: sales records, customer accounts, and survey data. They want to use this data to generate charts, graphs, and other pretty pictures which can be used strategically. Quite a few different terms are used for this type of data system, including analytics, business intelligence (BI), decision support (DSS), and online analytical processing (OLAP).
I've implemented a couple of these types of systems using the data from a
Point Of Sale (POS) system for retail chains. POS data consists pretty
much entirely of numeric values, inventory IDs, and category trees, so it creates nice roll-ups by category, time, and geography. Mining is not required because the data is already very well understood, and changes to the categorization scheme are infrequent.
A great deal of the functionality of analytics systems resides in the analytics middleware tools. Data solutions for analytics are all about aggregation of large amounts of data. Support for various kinds of advance analytics, such as "cubes" (explained below) is useful, as are data compression and advanced indexing. Data usually comes into these systems via a nightly batch process, so fast response times on writes is not that important.
Five types of database systems
Answering these three fundamental questions of data warehousing are, in the current market, five different major types of database systems. These systems span decades of software development. Of course, many real-life big database systems are actually hybrids of more than one of the five types below, but I will list examples by their primary categories.
Standard relational databases
If you only have a dozen to a few hundred gigabytes of data, standard
mainstream relational databases are still probably the way to go. Whether
you choose PostgreSQL, MySQL, Oracle, or SQL Server, all bring significant maturity, flexibility, and a large number of third-party and vendor tools. Perhaps more importantly, technical staff will already be familiar with them.
I helped a small retail chain deploy an analytics data warehouse for their inventory management system. Originally we were talking about doing it on a proprietary big database system, but once we did some tests we discovered that the maximum size of the data warehouse was going to be 350GB. Given that, we decided to keep them on mainstream open source PostgreSQL and save them some money and time.
Standard relational databases do not excel at any of the tasks of archiving, data mining, or analytics. However, they can do all of those tasks. So if your data warehousing problem is relatively small or not response-time-critical, then they are the way to go.
This is the oldest type of database designed for data warehousing, dating back some 20 years. MPP stands for "massively parallel processing", which essentially means a relational database where a single query can be executed across multiple processors on multiple machines or motherboards. Database administrators love this kind of database because, with a few limitations, you can treat it like a really big, really fast relational database server. MPP databases include Teradata, Netezza, Greenplum, and the data warehousing edition of DB2.
When I worked for Greenplum, I set up multiple "clickstream analytics"
databases, where we were processing large amounts of web log-data for marketing companies. We had no way of knowing what we were going to see in the logs, or even the page structure of the sites. We had to do a lot of CPU-intensive processing: aggregation over parsed text, running custom database functions, and building materialized views, for which a 16-node Greenplum database was quite fast.
MPP databases are good for both data mining and analytics. Some of them — particularly Greenplum — are also hybridizing other types of databases in this list. However, to date all of the production-quality MPP databases are proprietary, and generally very expensive for any really large database.
Invented in 1999, column
store (or C-store) databases work by changing the basic storage model
used for relational, or "row-based", databases. In a row-based database,
data is stored in contiguous rows of attributes, and columns are related
through table metadata. A column-store database turns this model 90
degrees, storing columns of attributes together and relating the rows only
through metadata. This permits quite a few optimizations, including
various forms of compression and very fast aggregates.
Current column stores include Vertica, Paraccel, Infobright, LucidDB, and MonetDB. While Vertica is probably the leading column-store database, the latter three are open source. Also, several databases of other types, including Aster Data and Greenplum, have been adding column stores as an option.
One of our clients was creating top-level radial charts for a few terabytes of hospital performance data. Since all of this data was numerics, ratings, or healthcare categories, Vertica turned out to be a very good solution which returned top-level summaries in a fraction of the time of a standard relational database.
Column stores are really suitable only for analytics, because all of the data must be well understood and highly structured to be stored in compressed columns. For that matter, C-stores are far more efficient with data which can be reduced to numbers and small lists of categories. Their main drawback is that they are slow to update or import data, and single-row updates are all but impossible.
The next innovation in data warehousing was popularized by Google less than a decade ago: MapReduce frameworks. MapReduce is really an algorithm which, when accompanied by clustering tools, allows you to take a single request and divide it into small portions to be executed across a large array of servers. When combined with some form of clustered or hash-partitioned storage, MapReduce allows users to perform large, long-running requests across tens to hundreds of nodes. Hadoop is the overwhelmingly dominant MapReduce framework. Current MapReduce-based databases include Hadoop with Hbase, Hadapt, Aster Data, and CouchDB.
On one project the client needed to be able to run requests across 30TB of mixed JSON and binary data. Because the search routines across the binary data were very processor-intensive, they put this on HBase and used Hadoop to run a lot of the processing routines, storing the query results in PostgreSQL for easy browsing later.
MapReduce is, in many ways, an open source, affordable alternative to MPP databases, and is primarily suitable for data mining. It also scales to larger numbers of nodes. However, MapReduce queries are a lot less efficient than MPP ones due to their generic nature — and are also a lot harder to write. This is changing thanks to tools like Hive and Pig which let users write MapReduce queries in a SQL-like syntax. Also, MapReduce databases are a lot younger than the preceding three types, making them less reliable and comparatively poorly documented.
The "new kid on the block" for data warehousing is enterprise search. This is so new that it really only consists of two open source products, both of them descendants of the Apache Lucene project: Solr and Elastic Search (ES). Enterprise search consists of doing multi-server partitioned indexing across large amounts of semi-structured data, in the form of "documents." Both also support "facets," which are materialized search indexes, allowing users to rapidly count and search documents by category, values, ranges, and even complex search expressions. Enterprise search also often gives "approximate" answers, which can be a feature or a defect, depending on your design goals.
Enterprise search is useful in some surprising places. We have one client who is using it to allow their clients to produce nuanced aggregate statistics on a very large body of legal documents. Putting this in Solr allowed the client to skip a lot of the data processing they needed to do to get it into other kinds of databases, while still giving them very fast search results. Particularly, Solr's precomputed counts in indexes allowed returning counts of documents much faster than in a relational database.
Enterprise search serves a subset of both data mining and analytics, making it broadly useful. Its biggest value comes when the data to be searched is already in HTML, XML, or JSON format and thus does not need to be converted or transformed before indexing. However, it's the youngest of the database types here and both products still have a lot of reliability issues and surprising limitations. Also, database requests are still tightly bound to the "search" model, which can make it difficult to use for very different use-cases.
As part of any data warehousing project, you will also need a variety of other tools in order to take your data from its source to a finished report or interface. While I cannot go into them in detail, here are the types of tools you need to know about:
Extract Transform Load (ETL) and Data Integration tools: these tools handle getting data from its original source into the final database format. Open source ETL tool leaders are Talend and KETTLE, and there are many proprietary tools such as Informatica. In modern infrastructures, open source queueing platforms like ActiveMQ, RabbitMQ, and custom code are replacing formal ETL tools for many applications.
Data Mining and Data Analysis tools: tools like Weka, SAS, and various programs in the R language provide advanced tools for making sense out of large volumes of unsorted data. They help you find patterns in your data through statistical analysis and machine learning algorithms. In this domain the open source tools (Weka and R) lead the market, and proprietary tools are primarily for legacy use.
Reporting Tools: since you need to actually present your data, you'll need
reporting tools like BIRT, JasperReports,
or proprietary platforms like Business
Objects or MicroStrategy. These tools provide simple visualizations of your data, usually in the form of charts and graphs, which may be interactive. Recently the two leading open source options have caught up with proprietary competitors in ease of use, but it will take them a while to surpass them for features.
Online Analytical Processing (OLAP): a deceptive name which actually has a lot to do with providing a navigation-based interface to exploring your data, using "cubes." OLAP tools like Mondrian, Cognos, and Microsoft Analysis Services create a multi-dimensional spatial map of data which allows users to see different parts of the data by moving around within it. This is one area where open source tools are really lagging; open source databases have weak OLAP support compared to Oracle and SQL Server, and Mondrian is the only open source OLAP middleware.
I also need to mention Pentaho, which is a kind of all-in-one open source platform which glues together various open source ETL, data mining, and reporting tools.
Overall, there are open source tools for all levels of a data warehousing stack, but these tools often lack maturity or features in comparison to their proprietary competitors. However, given that most of the cutting-edge development in the analytics and data mining space is open source today, it's likely that the balance will tip towards open source over the next three to five years.
You should now be better acquainted with the world of data warehousing, especially from the database perspective. While we couldn't cover topics in detail or mention every project or product, at least you know where to start searching, and that there are many tools out there for every niche in the data warehousing world.
You can also see this same material presented in video format, from the recent Open Source Bridge conference in Portland Oregon.
to post comments)