January 11, 2012
This article was contributed by Nathan Willis
The Apache Software Foundation (ASF) has declared its Hadoop software framework ready to be
called 1.0. Hadoop, a darling of the "Big Data" movement, is a framework
for writing distributed applications that process vast quantities of data
in parallel — where "vast" means petabyte-scale and larger, divided
up across thousands of nodes. In one sense, the 1.0
release is an arbitrary declaration by the Hadoop team that the core of
the framework (which has been in development for six years, and is in
widespread use) has reached enterprise-level stability suitable for
commercial adoption. But, coming from a top-level project at the ASF, the
"1.0" also
represents a commitment to long-term support from the community, and the
release includes notable improvements in security, database functionality,
and filesystem access.
A quick bit of background
Hadoop's core framework is an implementation of the MapReduce programming
paradigm. The MapReduce approach involves dividing a large data set
up among a cluster of compute nodes. The "map" step applies a single
(presumably simple) function to the entire data set; that function is executed in
parallel by each node on its own chunk. The "reduce" step then collects
the chunks of output generated by the nodes and applies another function
to combine them, as appropriate, into a result for the data set as a whole.
The canonical example is performing a word frequency count on a large set of documents. The set of documents is first divided up among the nodes, and the map function splits each document into individual words, returning the words as a list. The reduce function ingests all of the word lists produced by the mappers, and increments a separate counter for each word as it progresses.
In practice, a MapReduce application is also responsible for the potentially trickier steps of deciding how best to partition the input data set among the available nodes, how to sort (or otherwise prepare) the output returned by the mappers, and how to read and write the massive data sets between the nodes and storage. Hadoop supports multiple options for most of these tasks, including several job allocation algorithms, and several different storage back-ends (including read-only HTTP/HTTPS servers and Amazon S3). The MapReduce framework is also responsible for managing the communication between the master node and the child mapper and reducer nodes. The approach can be multi-tiered, so that a mapper node can subdivide its chunk of the data and split it up among child mapper nodes. MapReduce is capable of working with heterogeneous compute clusters, so particularly fast or multi-processor nodes can get assigned a larger portion of the input data by the framework's job coordinator.
The concepts at the heart of MapReduce programming are not themselves
exotic; similar ideas are well-known from multi-threading features in
several existing languages. But MapReduce was popularized in a 2003-era
USENIX paper
[PDF] published by Google, which described how the search giant used
MapReduce across large clusters on huge data sets — Google famously
used its in-house MapReduce tools to rebuild its index of the entire web.
Google's MapReduce was designed to operate on <key,value> pairs (a
natural match for the string-centric computations of web search), and a second
paper [PDF] described the "Google FS" (GFS) distributed filesystem that the company developed to support its MapReduce clusters.
Google's MapReduce implementation was also provided to users of its
AppEngine service, but it is not free software. In 2004, Doug Cutting,
co-creator of the Apache Lucene and
Nutch search projects, started
Hadoop as an open source implementation of the MapReduce concept while he
was an employee at Yahoo. Yahoo has remained one of Hadoop's chief
contributors and evangelists, and has been joined by other data-centric web
companies such as Facebook and eBay. Hadoop is designed to run on
commodity hardware, including heterogeneous nodes, and to scale up
rapidly.
Extras
In addition to its central MapReduce framework, Hadoop ships with a
number of infrastructure tools to support large MapReduce applications.
The most notable is HDFS, a distributed filesystem
designed to run on Hadoop clusters. In spite of the name, HDFS is not a
filesystem in the Linux kernel sense of the word; it is a node-based
storage system whose design mirrors that of a Hadoop cluster. A master
node called the NameNode keeps track of all file metadata, and coordinates
among the various DataNodes used for storage. Files are broken up into
blocks and replicated among the DataNodes, based on parameters that are
adjustable for optimum fault-tolerance or speed.
The new release adds several noteworthy features to the filesystem, starting with WebHDFS, a REST-like HTTP interface. This API exposes the complete filesystem interface over HTTP, via GET, PUT, POST, and DELETE, which makes it possible to manage an HDFS volume without writing custom Java or C client code. The filesystem can now also be protected against unauthorized access by requiring Kerberos authentication.
The MapReduce core also picks up a Kerberos authentication option, naturally. The Kerberos work is part of a larger security-hardening effort that was undertaken to prepare for Hadoop for 1.0. The other security changes include stricter permissions on files and directories, enabling access control lists (ACLs) for task resources, and ensuring that an application's task processes run as non-privileged users.
Hadoop itself is written in Java and provides a full Java API for MapReduce, but there are several interfaces designed to help developers code in other languages as well. The best known are Yahoo's Pig (a high-level scripting language), Facebook's Hive (which overlays a database-like structure and offers an SQL-like query interface), and Hadoop Streaming, which provides a text-based interface exposed through stdin and stdout. Using Hadoop Streaming (which, unlike Pig and Hive, is developed within the Hadoop project itself), developers can call executables written in any language as their mapper and reducer functions, routing data through them as they would with Unix pipes.
HDFS is not designed to serve as a relational database, optimized as it is for streaming read/write performance. But the popularity of projects like Hive show that many Hadoop users are interested in some level of database-like functionality for their MapReduce problems. Google created the proprietary BigTable to add a database layer to GFS; Hadoop created HBase to offer similar functionality for HDFS. HBase does not offer many of the features that other "NoSQL" RDBMS-replacement products advertise (such as typed columns, secondary indexes, and advanced queries). It does, however, offer a table structure and record lookups, and implements "strongly consistent" reads and writes, sharding, and some optimization techniques such as block caches and Bloom filters. Under the hood, though, HBase stores its data in HDFS files.
HBase is officially part of the 1.0 release, and is now a fully-supported storage option for MapReduce jobs. Like HDFS, it is accessible using either Java or REST APIs. Performance of HBase and BigTable has never been as fast as a traditional RDBMS, but Hadoop does say that the 1.0 release includes performance enhancements, particularly for access to HDFS files stored on the local disk.
The Big Data user community has already widely embraced Hadoop, with
heavyweight service providers like IBM and Oracle offering Hadoop-based
products, in addition to the smaller, cloud-oriented service companies and
various start-ups run by Hadoop project members. In many ways,
the 1.0 milestone (which according to the release notes is based on the
project's 0.20-security branch) is recognition of the stability the project
has already achieved.
Consequently, Hadoop add-ons may be the focus of the news-making future
developments for the project. Google has reportedly moved further away from
the original MapReduce itself in recent years, putting more work into the
GFS and BigTable layers of the stack. The various Hadoop service providers
and Big Data users (Yahoo and Facebook included) are similarly extending
the Hadoop core, with projects like Pig and Hive. The list of
Hadoop-derived projects includes a number of efforts to leverage Hadoop's
demonstrably-stable core to take on wildly different classes of problem:
database applications, data collection, machine learning, and even
configuration management. As beneficial as an open source MapReduce
implementation is on its own merits, this ripple effect will influence a far
wider set of computing tasks in the future.
(
Log in to post comments)