map reduce
DESCRIPTION
Map Reduce presentation. Operating Systems. University of Georgia2010TRANSCRIPT
Map Reduce
ByManuel Correa
Background
Large set of data needs to be processed in a fast and efficient way
In order to process large set of data in a reasonable amount time, this needs to be distributed across thousands of machines
Programmers need to focus in solving problems without worrying about the implementation
Map Reduce is the answer.
What is Map reduce?
Programming model for processing large data sets
Hides the implementation of parallelization, faul-tolerance, data distribution and load balancing in a library
Inspired on some characteristics functional programming
Functional operations do not modify data structures. They always create new ones
Original data is not modified
Data flow is implicit within the application
The order of the operations does not matter
What is Map reduce?
There is two functions: Map and Reduce
Map
Input: Key/Value pairs
Output: Intermediate key/value pairs
Reduce
Input: Key, Iterator values
Output: list with results
map(k1, v1) --> list(k2, v2)
reduce(k2, values(k2)) --> list(v2)
Complicated?
Map Reduce by exampleCounting each word in a large set of documents
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Map Reduce by exampleCounting each word in a large set of documents
Document_1
foo
bar
baz
foo
bar
test
Document_2
test
foo
baz
bar
foo
Expected results:
<foo, 4>,<bar, 3>,<baz,2>,<test,2>
Map Reduce by exampleCounting each word in a large set of documents
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
Map(document_1,contents(document_1))
<foo, “1”>
<bar,”1”>
<baz, “1” >
<foo, “1”>
<bar, “1”>
<test, ”1”>
Map(document_2,contents(document_2))
<test, “1”>
<foo, “1”>
<baz, ”1”>
<bar, ”1”>
<foo, “1”>
Map Reduce by exampleCounting each word in a large set of documents
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Reduce(word, values)
<foo, “2”>
<bar,”2”>
<baz, “1” >
<test,”1”>
Reduce(word, values)
<test, “1”>
<foo, “2”>
<baz, ”1”>
<bar, ”1”>
Map Reduce by exampleCounting each word in a large set of documents
Reduce(word, values)
<foo, “4”>
<bar, ”3”>
<baz, “2”>
<test,”2”>
<foo, “2”>
<bar, ”2”>
<baz, “1”>
<test,”1”>
<test, “1”>
<foo, “2”>
<baz, ”1”>
<bar, ”1”>
Expected results:
<foo, 4>,<bar, 3>,<baz,2>,<test,2>
Implementation
Master node
Master keeps different data structures for Map and reduce tasks where the status of each process is maintain
Status: idle, in-progress or completed
The master node keeps track of the intermediate files to feed the reduce tasks
The master node control the interaction between the M map tasks and R reduce tasks
Fault Tolerance
Master pings every worker periodically
If a worker fail, then the master mark this worker as failed and assign the task to another worker
Every worker must notify that has finish its task. The master then assign another task
Each tasks is independent and can be restarted at any moment. Map reduce is resilient to workers failures
If the master failed, then? The Master periodically its status and data structures. Then another master can start from the last checkpoint
Task Granularity
There are M maps tasks and R reduce tasks
M and R should be larger than the number of workers
Dynamic loading and load balancing on workers to optimize resources
Master must make O(M+R) scheduling decisions and keeps O(M*R) states. One byte to save the state of each worker
According to the paper, Google performs M=200,000 and R=5,000 using 2,000 workers
Refinements
Partition function: load balancing
Ordering function: optimized generation of keys and easy to generate sorted output files
Combiner function = Reduce function. See count word in documents example
Input and output Readers: Standard input and output
Skipping bad records: Control of bad input
Local execution for debugging
Status information through an external application
What are the benefits of map reduce?
Easy to use for programmers that don't need to worry about the details of distributed computing
A large set of problems can be expressed in Map reduce programming model
Flexible and scalable in large clusters of machines. The fault tolerance is elegant and works
Programs that can be expressed with Map Reduce
Distributed Grep <word, match>
Count URL Access Frequency <URL, total_count>
Reverse Web-link graph <target, list(source)>
Term-Vector per Host <word, frequency>
Inverted index <word, document ID>
Distributed Sort <key, record>
References
MapReduce: Simplified Data Processing on Large Clusters ( http://labs.google.com/papers/mapreduce-osdi04.pdf)
http://code.google.com/edu/parallel/mapreduce-tutorial.html
www.mapreduce.org
http://www.youtube.com/watch?v=yjPBkvYh-ss&feature=PlayList&p=01ABB666FB64D768&index=0&playnext=1
http://hadoop.apache.org/
Map Reduce
Questions?