Thanks to advances in storage, processing power, affordability of servers,
and above all the rapidly increasing number of software applications and
monitoring tools businesses seems to have these days, everyone seems to be
drowning in data. Many system administrators, application developers, and
junior database administrators are finding themselves doing "big data"
— ready or not — and have no familiarity with the terminology,
ideas, or tools in the field of data warehousing. Given that it's been
around for 40 years as a field, there's quite a bit to know. This article
will serve as a starting point for those wanting to learn the basics of
What is data warehousing?
Data warehousing isn't quite the same thing as "big data"; rather, it's a subset. Big data also includes systems which need to process millions of small requests per second across a large number of connections, such as Amazon's Dynamo database. Data warehousing generally refers to systems which accumulate data over large periods of time, and need to process a relatively small number of very large data requests.
Data warehousing really answers one of three different questions for organizations: archiving, data mining, or analytics. Sometimes it answers a combination of the three.
"I want to store large amounts of data for long
periods of time and might never query any of it."
Thanks to HIPAA, Sarbanes-Oxley, the 8th Company Law Directive, and other
regulations, this is an increasingly popular (or, at least, necessary) kind of data. It can also be
called "WORN" data, for Write Once Read Never. Companies accumulate large volumes of data which they don't really want, but can't throw away and theoretically need to be able to access in some reasonable amount of time. Storage sizes can range from gigabytes to terabytes.
A good example of this kind of system is the archive for European cell phone call completion data I helped build for Comptel and Sun Microsystems. With a projected size of 75 terabytes of data per city, each database was expected to answer less than one information request per week. This allowed us to build it using a very inexpensive combination of PostgreSQL, Solaris, and filesystem compression.
If you're building an archive, your only requirements are to minimize storage cost and to make sure the archive can keep up with the generation of data. Generally this means compression, and a database which works acceptably on large cheap hard drives or even tape libraries. Query speed and features are not a concern. Interestingly, this is the one type of database for which there are no well-known packaged open source solutions.
"I'm accumulating gigabytes of data a day and I know
there's valuable information in there, but I don't know what it is."
This is probably the most common type of data warehouse; most businesses and web sites generate large amounts of data as a side effect of operation. However, most don't have any clear idea how to utilize this data, they just know that they want to utilize it somehow. More importantly, the structure and meaning of a lot of the data may not be completely known; the data may be full of documents, unknown fields, and undefined categories. Data sizes are generally in the terabytes to petabytes. This is often referred to as "semi-structured" data.
Web clickstream analytics is probably the classic example of this kind of
application. Generally data comes in as a mixture of structured data and
message text from the web server logs and cookies. Companies accumulate this data because they can, and then gradually build up sets of queries and reports to try to get useful trending data out of the database.
Desired qualities in a data mining solution are the ability to perform CPU and I/O intensive searches, sorts, and computations as rapidly and efficiently as possible. Parallelism, either over multiple processors or multiple servers, is highly desirable. As a secondary concern, data mining databases often have to accept data at very high rates, as much as gigabytes per minute.
"I have large volumes of highly structured data which I want to use to produce visualizations in order to support business decisions."
Businesses also often generate data they understand very well: sales records, customer accounts, and survey data. They want to use this data to generate charts, graphs, and other pretty pictures which can be used strategically. Quite a few different terms are used for this type of data system, including analytics, business intelligence (BI), decision support (DSS), and online analytical processing (OLAP).
I've implemented a couple of these types of systems using the data from a
Point Of Sale (POS) system for retail chains. POS data consists pretty
much entirely of numeric values, inventory IDs, and category trees, so it creates nice roll-ups by category, time, and geography. Mining is not required because the data is already very well understood, and changes to the categorization scheme are infrequent.
A great deal of the functionality of analytics systems resides in the analytics middleware tools. Data solutions for analytics are all about aggregation of large amounts of data. Support for various kinds of advance analytics, such as "cubes" (explained below) is useful, as are data compression and advanced indexing. Data usually comes into these systems via a nightly batch process, so fast response times on writes is not that important.
Five types of database systems
Answering these three fundamental questions of data warehousing are, in the current market, five different major types of database systems. These systems span decades of software development. Of course, many real-life big database systems are actually hybrids of more than one of the five types below, but I will list examples by their primary categories.
Standard relational databases
If you only have a dozen to a few hundred gigabytes of data, standard
mainstream relational databases are still probably the way to go. Whether
you choose PostgreSQL, MySQL, Oracle, or SQL Server, all bring significant maturity, flexibility, and a large number of third-party and vendor tools. Perhaps more importantly, technical staff will already be familiar with them.
I helped a small retail chain deploy an analytics data warehouse for their inventory management system. Originally we were talking about doing it on a proprietary big database system, but once we did some tests we discovered that the maximum size of the data warehouse was going to be 350GB. Given that, we decided to keep them on mainstream open source PostgreSQL and save them some money and time.
Standard relational databases do not excel at any of the tasks of archiving, data mining, or analytics. However, they can do all of those tasks. So if your data warehousing problem is relatively small or not response-time-critical, then they are the way to go.
This is the oldest type of database designed for data warehousing, dating back some 20 years. MPP stands for "massively parallel processing", which essentially means a relational database where a single query can be executed across multiple processors on multiple machines or motherboards. Database administrators love this kind of database because, with a few limitations, you can treat it like a really big, really fast relational database server. MPP databases include Teradata, Netezza, Greenplum, and the data warehousing edition of DB2.
When I worked for Greenplum, I set up multiple "clickstream analytics"
databases, where we were processing large amounts of web log-data for marketing companies. We had no way of knowing what we were going to see in the logs, or even the page structure of the sites. We had to do a lot of CPU-intensive processing: aggregation over parsed text, running custom database functions, and building materialized views, for which a 16-node Greenplum database was quite fast.
MPP databases are good for both data mining and analytics. Some of them — particularly Greenplum — are also hybridizing other types of databases in this list. However, to date all of the production-quality MPP databases are proprietary, and generally very expensive for any really large database.
Invented in 1999, column
store (or C-store) databases work by changing the basic storage model
used for relational, or "row-based", databases. In a row-based database,
data is stored in contiguous rows of attributes, and columns are related
through table metadata. A column-store database turns this model 90
degrees, storing columns of attributes together and relating the rows only
through metadata. This permits quite a few optimizations, including
various forms of compression and very fast aggregates.
Current column stores include Vertica, Paraccel, Infobright, LucidDB, and MonetDB. While Vertica is probably the leading column-store database, the latter three are open source. Also, several databases of other types, including Aster Data and Greenplum, have been adding column stores as an option.
One of our clients was creating top-level radial charts for a few terabytes of hospital performance data. Since all of this data was numerics, ratings, or healthcare categories, Vertica turned out to be a very good solution which returned top-level summaries in a fraction of the time of a standard relational database.
Column stores are really suitable only for analytics, because all of the data must be well understood and highly structured to be stored in compressed columns. For that matter, C-stores are far more efficient with data which can be reduced to numbers and small lists of categories. Their main drawback is that they are slow to update or import data, and single-row updates are all but impossible.
The next innovation in data warehousing was popularized by Google less than a decade ago: MapReduce frameworks. MapReduce is really an algorithm which, when accompanied by clustering tools, allows you to take a single request and divide it into small portions to be executed across a large array of servers. When combined with some form of clustered or hash-partitioned storage, MapReduce allows users to perform large, long-running requests across tens to hundreds of nodes. Hadoop is the overwhelmingly dominant MapReduce framework. Current MapReduce-based databases include Hadoop with Hbase, Hadapt, Aster Data, and CouchDB.
On one project the client needed to be able to run requests across 30TB of mixed JSON and binary data. Because the search routines across the binary data were very processor-intensive, they put this on HBase and used Hadoop to run a lot of the processing routines, storing the query results in PostgreSQL for easy browsing later.
MapReduce is, in many ways, an open source, affordable alternative to MPP databases, and is primarily suitable for data mining. It also scales to larger numbers of nodes. However, MapReduce queries are a lot less efficient than MPP ones due to their generic nature — and are also a lot harder to write. This is changing thanks to tools like Hive and Pig which let users write MapReduce queries in a SQL-like syntax. Also, MapReduce databases are a lot younger than the preceding three types, making them less reliable and comparatively poorly documented.
The "new kid on the block" for data warehousing is enterprise search. This is so new that it really only consists of two open source products, both of them descendants of the Apache Lucene project: Solr and Elastic Search (ES). Enterprise search consists of doing multi-server partitioned indexing across large amounts of semi-structured data, in the form of "documents." Both also support "facets," which are materialized search indexes, allowing users to rapidly count and search documents by category, values, ranges, and even complex search expressions. Enterprise search also often gives "approximate" answers, which can be a feature or a defect, depending on your design goals.
Enterprise search is useful in some surprising places. We have one client who is using it to allow their clients to produce nuanced aggregate statistics on a very large body of legal documents. Putting this in Solr allowed the client to skip a lot of the data processing they needed to do to get it into other kinds of databases, while still giving them very fast search results. Particularly, Solr's precomputed counts in indexes allowed returning counts of documents much faster than in a relational database.
Enterprise search serves a subset of both data mining and analytics, making it broadly useful. Its biggest value comes when the data to be searched is already in HTML, XML, or JSON format and thus does not need to be converted or transformed before indexing. However, it's the youngest of the database types here and both products still have a lot of reliability issues and surprising limitations. Also, database requests are still tightly bound to the "search" model, which can make it difficult to use for very different use-cases.
As part of any data warehousing project, you will also need a variety of other tools in order to take your data from its source to a finished report or interface. While I cannot go into them in detail, here are the types of tools you need to know about:
Extract Transform Load (ETL) and Data Integration tools: these tools handle getting data from its original source into the final database format. Open source ETL tool leaders are Talend and KETTLE, and there are many proprietary tools such as Informatica. In modern infrastructures, open source queueing platforms like ActiveMQ, RabbitMQ, and custom code are replacing formal ETL tools for many applications.
Data Mining and Data Analysis tools: tools like Weka, SAS, and various programs in the R language provide advanced tools for making sense out of large volumes of unsorted data. They help you find patterns in your data through statistical analysis and machine learning algorithms. In this domain the open source tools (Weka and R) lead the market, and proprietary tools are primarily for legacy use.
Reporting Tools: since you need to actually present your data, you'll need
reporting tools like BIRT, JasperReports,
or proprietary platforms like Business
Objects or MicroStrategy. These tools provide simple visualizations of your data, usually in the form of charts and graphs, which may be interactive. Recently the two leading open source options have caught up with proprietary competitors in ease of use, but it will take them a while to surpass them for features.
Online Analytical Processing (OLAP): a deceptive name which actually has a lot to do with providing a navigation-based interface to exploring your data, using "cubes." OLAP tools like Mondrian, Cognos, and Microsoft Analysis Services create a multi-dimensional spatial map of data which allows users to see different parts of the data by moving around within it. This is one area where open source tools are really lagging; open source databases have weak OLAP support compared to Oracle and SQL Server, and Mondrian is the only open source OLAP middleware.
I also need to mention Pentaho, which is a kind of all-in-one open source platform which glues together various open source ETL, data mining, and reporting tools.
Overall, there are open source tools for all levels of a data warehousing stack, but these tools often lack maturity or features in comparison to their proprietary competitors. However, given that most of the cutting-edge development in the analytics and data mining space is open source today, it's likely that the balance will tip towards open source over the next three to five years.
You should now be better acquainted with the world of data warehousing, especially from the database perspective. While we couldn't cover topics in detail or mention every project or product, at least you know where to start searching, and that there are many tools out there for every niche in the data warehousing world.
You can also see this same material presented in video format, from the recent Open Source Bridge conference in Portland Oregon.
Comments (17 posted)
Last week's Kernel Page included an article on
the top contributors to the 3.0 kernel
, an extended version of our
traditional look at who participated in each kernel development cycle.
Since this article is traditional, it tends not to draw a lot of
attention. This time was different: quite a few publications have picked
up on the fact that Microsoft was one of the top contributors of changesets
by virtue of a long series of cleanups to its "hv" driver in the staging
tree. Those sites, seemingly, got a lot more mileage out of those results
than LWN did, an amusing outcome which can be expected occasionally with
subscriber-only content. That said, this outcome is a bit dismaying for
Some background: the hv driver helps Linux to function as a virtualized
Windows. It is useful code which, with luck, will soon move out of the
staging tree and into the mainline kernel proper. After a period of
relative neglect, developers at Microsoft have started cleaning up hv with
a vengeance - 366 hv patches were merged for the 3.0 development cycle.
This work has clear value; it is aimed at getting this code ready to
graduate from staging; it is worth having.
That said, let's look at the actual patches. 47 of them simply
move functions around to eliminate the need for forward declarations; 39 of
them rename functions and variables; 135 of them take the form "get rid of
X" for some value of (usually unused) X. Clearly this 15,000-line driver
needed a lot of cleaning, and it's good that people are doing the work.
But it also seems somewhat uncontroversial to say that this particular body
of work does not constitute one of the more significant contributions to
the 3.0 kernel.
Part of the problem, beyond any doubt, is the creation of lists of top
changeset contributors in the first place. The number of changes is an
extremely poor metric if one is interested in how much real, interesting
work was contributed to the kernel. Some changes are clearly more
interesting than others. Highlighting changeset counts may have ill effects
beyond misleading readers; if the number of changesets matters, developers
will have an incentive to bloat their counts through excessive splitting of
changes - an activity which, some allege, has been going on for a while
LWN does post another type of statistic - the number of lines changed. As
with changeset counts, this number does contain a modicum of information.
But as a metric for the value of kernel contributions, it is arguably even
worse than changeset counts. Picking a favorite Edsger Dijkstra quote is a
challenging task, but this one would be a contender:
If we wish to count lines of code, we should not regard them as
"lines produced" but as "lines spent".
Just because a patch changes (or adds) a lot of code does not mean that
there is a lot of value to be found therein.
Given these problems, one might be tempted to just stop producing these
statistics at all. Yet these metrics clearly have enough value to be
interesting. When LWN first started posting these numbers, your editor was
approached at conferences by representatives from two large companies who
wanted to discuss discrepancies between those numbers and the ones they had
been generating internally. We are routinely contacted by companies wanting to
be sure that all of their contributions are counted properly. Developers
have reported receiving job offers as a result of their appearance in the
lists of top contributors. Changeset counts are also used to generate the
initial list of nominees to the Kernel Summit. For better or for worse,
people want to know who the most significant contributors to the kernel
So it would be good to find some kind of metric which yields that
information in a more useful way than a simple count of changesets or lines
of code. People who understand the code can usually look at a patch and
come to some sort of internal assessment - though your editor might be
disqualified by virtue of having once suggested that merging devfs would be
a good idea. But the reason why even that flawed judgment is not used in LWN's
lists is simple: when a typical development cycle brings in 10,000
changesets, the manual evaluation process simply does not scale.
So we would like to have a metric which would try, in an automated fashion,
to come up with an improved estimate of the value of each patch. That does
not sound like an easy task. One could throw out some ideas for heuristics
as a place to start; here are a few:
- Changes to core code (in the kernel, mm, and
fs directories, say) affect more users and are usually more
heavily reviewed; they should probably be worth more.
- Bug fixes have value. A heuristic could try to look to see if the
changelog contains a bug ID, whether the patch appears in a stable
update, or whether it is a revert of a previous change.
- Patches that shuffle code but add no functional change generally have
relatively low value. Patches adding documentation, instead, are
- Patches that remove code are generally good. A patch that adds code
to a common directory and removes it from multiple other locations may
be even better.
Patches adding significant code which appears to be cut-and-pasted
from elsewhere may have negative value.
- Changes merged late in the development cycle may not have a high
value, but, if the development process is working as it should be,
they should be worth something above the minimum.
- Changes merged directly by Linus presumably have some quality which
caught his attention.
Once a scoring system was in place, one could, over time, try to develop a
series of rules like the above in an attempt to better judge the true value
of a developer's contribution.
That said, any such metric will certainly be seen as unfair by at least
some developers - and rightly so, since it will undoubtedly be
unfair. This problem has no solution that will be universally seen as
correct. So, while we may well play with some of these ideas, it seems
likely that we are stuck with changeset and lines-changed counts for the
indefinite future. These metrics, too, are unfair, but at least they are
unfair in an easily understood and objectively verifiable way.
Comments (34 posted)
The Samba project has started
discussions to alter its long-standing copyright policy. For years, the
project has accepted only code contributions where the copyright is held by
an individual rather than by a corporation. Under the proposed change, the
project would accept corporate-copyrighted code, but would still prefer
that its contributors submit work as individuals.
Jeremy Allison posed the question to the Samba community in a July 12th email to the samba-technical mailing list. He described the proposal as coming from himself and Andrew Tridgell, a solicitation for input from the broader Samba community before putting the issue to a vote by Samba team members themselves.
policy is detailed on the project's site, along with a note that
contributors for whom a personal copyright is infeasible can also choose to
assign their copyright to the Software
Freedom Conservancy (SFC), which has Samba as one of its member
Because Samba is a clean-room reimplementation of proprietary software, the
project also requires that
developers who have signed Microsoft's CIFS license agreement not submit
code, and that contributors ensure that their patches do not infringe on
known patents. Although all free software projects and developers are well
aware of the
trouble fueled by software patent litigation, Samba is in the minority of
projects by asking its contributors to assert up front that their work is non-infringing.
Contributions, yesterday and today
Allison lists two reasons for the project's historical stance against allowing corporate-copyrighted contributions. First, the team simply preferred that GPL enforcement decisions be made by individuals not by corporations. "Enforcement decisions" would typically include how and when to contact suspected violators, who would speak on behalf of the project, and what course of action to pursue at each stage of the discussion.
The second reason deals specifically with the consequences of a
successfully-resolved license violation — meaning one in which the
violator has come back into compliance. Under GPLv2 (which Samba was
available under prior to the 3.2 release in July of 2007), a violator
automatically loses all rights to modify or distribute the software under
the license (section 4), and those rights can only be reinstated by the
copyright holders' collective decision. Individuals, the Samba project
believes, act much more reasonably in these rights-reinstatement
circumstances than corporations sometimes do.
The termination-of-rights section in GPLv2 is commonly
overlooked (see section 5.2) by casual readers, and was one of the key
points in drafting GPLv3. Because the earlier license did not specifically
address how copyright holders go about reinstating the rights of former
violators, every project could enact its own policy, or be completely
arbitrary on a case-by-case basis. A high-profile example of the confusion
over this issue is Richard Stallman's public argument
that early KDE releases automatically lost their redistribution rights to some
GPL-covered software by combining it with the QPL-licensed Qt toolkit, and
call for the affected copyright-holders to publicly reinstate the project's
rights. In response, KDE called
Stallman's interpretation of the clause "absurd" and refused
to request "forgiveness" as Stallman suggested.
The corresponding section in GPLv3 (section 8) attempts to standardize the situation and provide for a simple reinstatement procedure. A former violator's rights to redistribute the licensed software are reinstated if the violation is corrected and the copyright holders do not make further complaints for 60 days, or if the violator corrects its first violation within 30 days of the first notice. Furthermore, "provisional" permission to redistribute the software is granted as soon as the violation is fixed, pending the expiration of the 30- or 60-day window (whichever applies).
Due to that change, Allison said, the GPLv3 releases of Samba no longer benefited by excluding corporate-copyrighted patches outright. However, he did recommend that the project continue to encourage personal copyrights (or SFC copyright-assignment) over corporate contributions for most code, and continue to ban corporate copyrights for the project's library components or "code that might be moved into a library." Allison does not draw a bright line between the situations where personal copyright would still be required and where it would only be encouraged, but the intent seems to be to free up companies to contribute fixes and patches, rather than to open the door for entirely new features or large modules submitted under a corporate copyright.
The reason for the desire to retain the individual-only policy for more
significant changes is re-licensing (an issue which is also referred to on
the current policy page). Consent of the copyright holder is also required
to re-license the contribution. Although it is not a frequent occurrence,
the Samba project believes that getting assent for the re-license from
individuals is both simpler and faster than getting it from corporations
— which typically require decisions to be made by internal managerial
and legal teams.
Samba has re-licensed portions of its code in the past, notably the tdb and talloc libraries were re-licensed under the LGPL. But for general patches, including "things like build fixes for specific platforms," Allison said the non-corporate policy can only delay or prevent potential contributors from sending in their work.
Both individual and corporate contributors participated in the subsequent list discussion. By and large the community either approved of the proposal or simply deferred to the Samba team's wishes, but there were a few concerns voiced over the split-policy nature of the draft policy.
Simo Sorce, the Samba team GPL compliance officer, speculated as to whether corporate-copyrighted contributions of new features would hinder the team's later ability to split new features into separate libraries (thus triggering the re-licensing problem). Corporate copyrights not only slow down the re-licensing process, he said, but potentially cause greater problems when the corporation in question shuts down or is sold.
Andrew Bartlett expressed
concern that having a more complex policy regarding corporate
copyrights would make it harder to communicate to companies about their
responsibilities, which could slow down development by forcing them to
adjust their engineers' employment agreements. Generally, engineers'
employment agreements (or the legal norms of their jurisdiction) dictate that all code produced while a salaried employee are "for hire" works with the copyright held by the company. Even in those situations where the developer retains copyright, however, he or she must typically get approval from upstairs to contribute.
The select circle of companies who devote employee time to Samba
development must make a contractual exception in order to comply with
Samba's current policy (though judging by list traffic these companies have
been willing to do so because of their desire to contribute to Samba).
Bartlett's concern, then, is that subdividing the copyright policy to
distinguish between different types of code contributions would further
complicate the contracts of corporate-paid Samba contributors. Bartlett
also pointed out that the contribution guidelines need clarification
between code and non-code contributions, which are currently not treated
differently. Clarifying the situation around non-code contributions could
encourage companies to contribute documentation or help to the wiki, he
A week has passed since the initial message to the list, and so far there are no adamant objections, so it appears likely that the new policy (perhaps with clarifications) will eventually go into effect.
In the end, the policy shift is a fairly minor change, and in effect
loosens some of the restrictions on corporate contributions. So it's not
too much of a surprise that there has been little opposition. It is also interesting to observe that the discussion came up so soon after the final release of the much-debated Harmony project, which set out to craft a suite of contributor agreements.
It does not appear that the Harmony project's templates or tools had an influence on the Samba team's decision or process. However, it also appears that the Harmony project's agreements leave several issues of importance to the Samba team unaddressed. Harmony encompasses two separate types of agreement: contributor license agreements (CLAs) and copyright assignment agreements (CAAs). Samba's copyright policy is essentially a CLA, albeit an informal one instead of a signed contract.
The Harmony CLA templates cover ensuring the originality of the submitted code, and deal at length with the project's right to re-license the work as a whole. They do not, however, deal with either the patent-non-infringement issue or the clean-room reimplementation guarantee that Samba requires of contributors. Regarding the re-licensing question, having that permission in place from the beginning might simplify Samba's occasional re-licensing work — although historically it is a rare enough occurrence that the project has never felt the need to address it up front. In his message to the list, Sorce did ask corporate contributors for feedback on the re-licensing process, including the need to get "promises about the kind of licenses we will consider in case changes are needed."
The Harmony CLAs also do not provide a framework for Samba to specify that only contributions under personal copyrights are permitted, however, nor to provide separate rules for minor patches, new functionality, or library code.
The next major release of Samba is version 3.6.0, which is currently in
its second release-candidate testing phase. Allison and the other members
of the team did not set a timeline for a new copyright policy to take
effect — if one is eventually adopted — but presumably it
would not be before the next development cycle begins after a vote of the
Comments (1 posted)
Page editor: Jonathan Corbet
Next page: Security>>