Hadoop rings in the new year with a 1.0 release

January 11, 2012

This article was contributed by Nathan Willis

The Apache Software Foundation (ASF) has declared its Hadoop software framework ready to be called 1.0. Hadoop, a darling of the "Big Data" movement, is a framework for writing distributed applications that process vast quantities of data in parallel — where "vast" means petabyte-scale and larger, divided up across thousands of nodes. In one sense, the 1.0 release is an arbitrary declaration by the Hadoop team that the core of the framework (which has been in development for six years, and is in widespread use) has reached enterprise-level stability suitable for commercial adoption. But, coming from a top-level project at the ASF, the "1.0" also represents a commitment to long-term support from the community, and the release includes notable improvements in security, database functionality, and filesystem access.

A quick bit of background

Hadoop's core framework is an implementation of the MapReduce programming paradigm. The MapReduce approach involves dividing a large data set up among a cluster of compute nodes. The "map" step applies a single (presumably simple) function to the entire data set; that function is executed in parallel by each node on its own chunk. The "reduce" step then collects the chunks of output generated by the nodes and applies another function to combine them, as appropriate, into a result for the data set as a whole. The canonical example is performing a word frequency count on a large set of documents. The set of documents is first divided up among the nodes, and the map function splits each document into individual words, returning the words as a list. The reduce function ingests all of the word lists produced by the mappers, and increments a separate counter for each word as it progresses.

In practice, a MapReduce application is also responsible for the potentially trickier steps of deciding how best to partition the input data set among the available nodes, how to sort (or otherwise prepare) the output returned by the mappers, and how to read and write the massive data sets between the nodes and storage. Hadoop supports multiple options for most of these tasks, including several job allocation algorithms, and several different storage back-ends (including read-only HTTP/HTTPS servers and Amazon S3). The MapReduce framework is also responsible for managing the communication between the master node and the child mapper and reducer nodes. The approach can be multi-tiered, so that a mapper node can subdivide its chunk of the data and split it up among child mapper nodes. MapReduce is capable of working with heterogeneous compute clusters, so particularly fast or multi-processor nodes can get assigned a larger portion of the input data by the framework's job coordinator.

The concepts at the heart of MapReduce programming are not themselves exotic; similar ideas are well-known from multi-threading features in several existing languages. But MapReduce was popularized in a 2003-era USENIX paper [PDF] published by Google, which described how the search giant used MapReduce across large clusters on huge data sets — Google famously used its in-house MapReduce tools to rebuild its index of the entire web. Google's MapReduce was designed to operate on <key,value> pairs (a natural match for the string-centric computations of web search), and a second paper [PDF] described the "Google FS" (GFS) distributed filesystem that the company developed to support its MapReduce clusters.

Google's MapReduce implementation was also provided to users of its AppEngine service, but it is not free software. In 2004, Doug Cutting, co-creator of the Apache Lucene and Nutch search projects, started Hadoop as an open source implementation of the MapReduce concept while he was an employee at Yahoo. Yahoo has remained one of Hadoop's chief contributors and evangelists, and has been joined by other data-centric web companies such as Facebook and eBay. Hadoop is designed to run on commodity hardware, including heterogeneous nodes, and to scale up rapidly.

Extras

In addition to its central MapReduce framework, Hadoop ships with a number of infrastructure tools to support large MapReduce applications. The most notable is HDFS, a distributed filesystem designed to run on Hadoop clusters. In spite of the name, HDFS is not a filesystem in the Linux kernel sense of the word; it is a node-based storage system whose design mirrors that of a Hadoop cluster. A master node called the NameNode keeps track of all file metadata, and coordinates among the various DataNodes used for storage. Files are broken up into blocks and replicated among the DataNodes, based on parameters that are adjustable for optimum fault-tolerance or speed.

The new release adds several noteworthy features to the filesystem, starting with WebHDFS, a REST-like HTTP interface. This API exposes the complete filesystem interface over HTTP, via GET, PUT, POST, and DELETE, which makes it possible to manage an HDFS volume without writing custom Java or C client code. The filesystem can now also be protected against unauthorized access by requiring Kerberos authentication.

The MapReduce core also picks up a Kerberos authentication option, naturally. The Kerberos work is part of a larger security-hardening effort that was undertaken to prepare for Hadoop for 1.0. The other security changes include stricter permissions on files and directories, enabling access control lists (ACLs) for task resources, and ensuring that an application's task processes run as non-privileged users.

Hadoop itself is written in Java and provides a full Java API for MapReduce, but there are several interfaces designed to help developers code in other languages as well. The best known are Yahoo's Pig (a high-level scripting language), Facebook's Hive (which overlays a database-like structure and offers an SQL-like query interface), and Hadoop Streaming, which provides a text-based interface exposed through stdin and stdout. Using Hadoop Streaming (which, unlike Pig and Hive, is developed within the Hadoop project itself), developers can call executables written in any language as their mapper and reducer functions, routing data through them as they would with Unix pipes.

HDFS is not designed to serve as a relational database, optimized as it is for streaming read/write performance. But the popularity of projects like Hive show that many Hadoop users are interested in some level of database-like functionality for their MapReduce problems. Google created the proprietary BigTable to add a database layer to GFS; Hadoop created HBase to offer similar functionality for HDFS. HBase does not offer many of the features that other "NoSQL" RDBMS-replacement products advertise (such as typed columns, secondary indexes, and advanced queries). It does, however, offer a table structure and record lookups, and implements "strongly consistent" reads and writes, sharding, and some optimization techniques such as block caches and Bloom filters. Under the hood, though, HBase stores its data in HDFS files.

HBase is officially part of the 1.0 release, and is now a fully-supported storage option for MapReduce jobs. Like HDFS, it is accessible using either Java or REST APIs. Performance of HBase and BigTable has never been as fast as a traditional RDBMS, but Hadoop does say that the 1.0 release includes performance enhancements, particularly for access to HDFS files stored on the local disk.

The Big Data user community has already widely embraced Hadoop, with heavyweight service providers like IBM and Oracle offering Hadoop-based products, in addition to the smaller, cloud-oriented service companies and various start-ups run by Hadoop project members. In many ways, the 1.0 milestone (which according to the release notes is based on the project's 0.20-security branch) is recognition of the stability the project has already achieved.

Consequently, Hadoop add-ons may be the focus of the news-making future developments for the project. Google has reportedly moved further away from the original MapReduce itself in recent years, putting more work into the GFS and BigTable layers of the stack. The various Hadoop service providers and Big Data users (Yahoo and Facebook included) are similarly extending the Hadoop core, with projects like Pig and Hive. The list of Hadoop-derived projects includes a number of efforts to leverage Hadoop's demonstrably-stable core to take on wildly different classes of problem: database applications, data collection, machine learning, and even configuration management. As beneficial as an open source MapReduce implementation is on its own merits, this ripple effect will influence a far wider set of computing tasks in the future.

Index entries for this article
GuestArticles	Willis, Nathan

Hadoop rings in the new year with a 1.0 release

Posted Jan 12, 2012 9:44 UTC (Thu) by tpo (subscriber, #25713) [Link]

What a very good review of Hadoop. It very clearly and simply explains what Hadoop is, how it works, points to basic reference material, looks at it's ecosystem and history and examines future directions. And all that in a well digestible, succint manner.

Your articles seem to be getting better and better Nathan. From Jon's well informed, friendly sarcastic kernel-only magazine, lwn has succeeded in evolving into a publication resting on multiple excellent regular contributors.

My very heartily congratulations. Please keep those good articles coming! Thank you!

Hadoop vs. MapReduce

Posted Jan 26, 2012 0:52 UTC (Thu) by roelofs (guest, #2599) [Link]

Hadoop's core framework is an implementation of the MapReduce programming paradigm.

Minor clarification: Hadoop 1.0 can accurately be considered "just" an implementation of the map-reduce paradigm, but 2.0 splits the core framework from the MR application. MAPREDUCE-279 was the tracker for that work, which (AFAIK) is now undergoing stabilization on the 0.23 branch (possibly since renamed).

Post-split, it will be possible to implement MPI and other application layers on top of the core scheduling and resource-management layer.

Greg