MapReduce: Difference between revisions

Content deleted Content added

Inline

Revision as of 21:55, 21 June 2007

MapReduce is a software framework originally developed by Google to support parallel computations over large (greater than 100 terabyte) data sets on unreliable clusters of computers. The name is derived from the map and reduce functions commonly used in functional programming.^[1]

MapReduce implementations have been written in C++, Python, Ruby, Java and some other languages.

Dataflow

The frozen part of the MapReduce framework is a large distributed sort. The hot spots, which the application defines, are:

an input reader
a Map function
a partition function
a compare function
a Reduce function
an output writer

Input reader

The input reader divides the input into 16MB to 128MB splits and the framework assigns one split to each Map function. The input reader reads the data from stable storage (typically a distributed file system like Google File System) and generates key, value pairs.

A common example will read a directory full of text files and return each line as a record.

Map function

Each Map function gets series of key, value pairs; processes each; and generates 0 or more output key, value pairs. The input and output types of the map can be and often are different from each other.

If the application is doing a word count, the map function would break the line into words and output the word as the key and $1$ as the value.

Partition function

The output of all of the maps are allocated to particular reduces by the applications's partition function. The partition function is given the key and the number of reduces and returns the index of the desired reduce.

A typical default is to hash the key and modulo the number of reduces.

Comparison function

The input for each reduce is pulled from the machine where the map ran and sorted using the application's comparison function.

Reduce function

The framework calls the application's reduce function once for each unique key in the sorted order. The reduce can iterate through the values that are associated with that key and output 0 or more key, value pairs.

In the word count example, the reduce function takes the input values, sums them and generates a single output of the word and the final sum.

Output writer

The Output Writer writes the output of the reduce to stable storage, usually a distributed file system, such as Google File System.

Distribution and reliability

MapReduce achieves reliability by parceling out a number of operations on the set of data to each node in the network; each node is expected to report back periodically with completed work and status updates. If a node falls silent for longer than that interval, the master node (similar to the master server in the Google File System) records the node as dead, and sends out the node's assigned work to other nodes. Individual operations use atomic operations for naming file outputs as a double check to ensure that there are not parallel conflicting threads running; when files are renamed, it is possible to also copy them to another name in addition to the name of the task (allowing for side-effects).

The reduce operations operate much the same way, but because of their inferior properties with regard to parallel operations, the master node attempts to schedule reduce operations on the same node, or as close as possible to the node holding the data being operated on; this property is desirable for Google as it conserves bandwidth.

Uses

MapReduce is useful in a wide range of applications, including: "distributed grep, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, statistical machine translation..." Most significantly, when MapReduce was finished, it was used to completely regenerate Google's index of the Internet, and replaced the old ad hoc programs that updated the index and ran the various analyses. ^[2]

MapReduce's stable inputs and outputs are usually stored in a distributed file system, such as Google File System. The transient data is usually stored on local disk and fetched remotely by the reduces.

Implementations

The Google MapReduce framework is implemented in C++ with interfaces in Python and Java.
The Hadoop project [1] is a free open source Java MapReduce implementation.
A Ruby implementation is available in the Starfish gem [2].

References

Dean, Jeffrey & Ghemawat, Sanjay (2004). "MapReduce: Simplified Data Processing on Large Clusters". Retrieved Apr. 6, 2005.

^ "Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages." -"MapReduce: Simplified Data Processing on Large Clusters", by Jeffrey Dean and Sanjay Ghemawat; from Google Labs
^ "As of October, Google was running about 3,000 computing jobs per day through MapReduce, representing thousands of machine-days, according to a presentation by Dean. Among other things, these batch routines analyze the latest Web pages and update Google's indexes." *"How Google Works"

External links

Papers

"MapReduce: Simplified Data Processing on Large Clusters" — paper by Jeffrey Dean and Sanjay Ghemawat; from Google Labs
"Interpreting the Data: Parallel Analysis with Sawzall" — paper by Rob Pike, Sean Dorward, Robert Griesemer, Sean Quinlan; from Google Labs
"Google's MapReduce Programming Model -- Revisited" — paper by Ralf Lammel; from Microsoft
Hung-Chih Yang, Ali Dasdan, Ruey-Lung Hsiao, D. Stott Parker (2007). "Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters". To appear in Proc. of SIGMOD'07. (This paper shows how to extend MapReduce for relational data processing.)
Heritrix Hadoop DFS Writer Processor - Heritrix extension to store crawled content directly into the Hadoop Distributed FileSystem (HDFS) for processing with Map-Reduce by Doug Judd; from Zvents

Articles

"How Google Works - Reducing Complexity" — article from Baseline magazine
"Can Your Programming Language Do This?" — article from the Joel on Software weblog
Nutch MapReduce — article about MapReduce in Nutch from Tom White's weblog
Cat MapReduce — article about MapReduce in Cat from the Cat project wiki.
"Simple Map Reduce in Ruby" - article about using SimpleMapReduce on Ruby's Rinda which uses DrbRuby

Software

Starfish — Open source Ruby implementation of MapReduce.
Hadoop — open source MapReduce implementation from Apache
IBM MapReduce Tools for Eclipse — a plug-in that supports the creation of MapReduce applications within Eclipse.
QtConcurrent Open Source C++ MapReduce (non-distributed) implementation from Trolltech

[map-1] "Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages." -"MapReduce: Simplified Data Processing on Large Clusters", by Jeffrey Dean and Sanjay Ghemawat; from Google Labs

[usage-2] "As of October, Google was running about 3,000 computing jobs per day through MapReduce, representing thousands of machine-days, according to a presentation by Dean. Among other things, these batch routines analyze the latest Web pages and update Google's indexes." *"How Google Works"

[1]

[2]

@@ Line 69: / Line 69: @@
 * [http://weblogs.java.net/blog/tomwhite/archive/2005/09/mapreduce.html Nutch MapReduce] — article about MapReduce in [[Nutch]] from Tom White's weblog
 * [http://code.google.com/p/cat-language/wiki/MapReduce Cat MapReduce] — article about MapReduce in [[Cat (programming language)|Cat]] from the Cat project wiki.
-* [http://multipart-mixed.com/software/simple_mapreduce_in_ruby.html Simple Map Reduce in Ruby] - article about using SimpleMapReduce on Ruby's [[Rinda]]
+* [http://multipart-mixed.com/software/simple_mapreduce_in_ruby.html "Simple Map Reduce in Ruby"] - article about using SimpleMapReduce on Ruby's [[Rinda]] which uses DrbRuby
 ===Software===