map reduce

Map Reduce

ByManuel Correa

Background

Large set of data needs to be processed in a fast and efficient way

In order to process large set of data in a reasonable amount time, this needs to be distributed across thousands of machines

Programmers need to focus in solving problems without worrying about the implementation

Map Reduce is the answer.

What is Map reduce?

Programming model for processing large data sets

Hides the implementation of parallelization, faul-tolerance, data distribution and load balancing in a library

Inspired on some characteristics functional programming

Functional operations do not modify data structures. They always create new ones

Original data is not modified

Data flow is implicit within the application

The order of the operations does not matter

What is Map reduce?

There is two functions: Map and Reduce

Map

Input: Key/Value pairs

Output: Intermediate key/value pairs

Reduce

Input: Key, Iterator values

Output: list with results

map(k1, v1) --> list(k2, v2)

reduce(k2, values(k2)) --> list(v2)

Complicated?

Map Reduce by exampleCounting each word in a large set of documents

map(String key, String value):

// key: document name

// value: document contents

for each word w in value:

EmitIntermediate(w, "1");

reduce(String key, Iterator values):

// key: a word

// values: a list of counts

int result = 0;

for each v in values:

result += ParseInt(v);

Emit(AsString(result));


Document_1

foo

bar

baz

foo

bar

test

Document_2

test

foo

baz

bar

foo

Expected results:

<foo, 4>,<bar, 3>,<baz,2>,<test,2>


map(String key, String value):

// key: document name

// value: document contents

for each word w in value:

EmitIntermediate(w, "1");

Map(document_1,contents(document_1))

<foo, “1”>

<bar,”1”>

<baz, “1” >

<foo, “1”>

<bar, “1”>

<test, ”1”>

Map(document_2,contents(document_2))

<test, “1”>

<foo, “1”>

<baz, ”1”>

<bar, ”1”>

<foo, “1”>


reduce(String key, Iterator values):

// key: a word

// values: a list of counts

int result = 0;

for each v in values:

result += ParseInt(v);

Emit(AsString(result));

Reduce(word, values)

<foo, “2”>

<bar,”2”>

<baz, “1” >

<test,”1”>


<test, “1”>

<foo, “2”>

<baz, ”1”>

<bar, ”1”>



<foo, “4”>

<bar, ”3”>

<baz, “2”>

<test,”2”>

<foo, “2”>

<bar, ”2”>

<baz, “1”>

<test,”1”>

<test, “1”>

<foo, “2”>

<baz, ”1”>

<bar, ”1”>

Expected results:

<foo, 4>,<bar, 3>,<baz,2>,<test,2>

Implementation

Master node

Master keeps different data structures for Map and reduce tasks where the status of each process is maintain

Status: idle, in-progress or completed

The master node keeps track of the intermediate files to feed the reduce tasks

The master node control the interaction between the M map tasks and R reduce tasks

Fault Tolerance

Master pings every worker periodically

If a worker fail, then the master mark this worker as failed and assign the task to another worker

Every worker must notify that has finish its task. The master then assign another task

Each tasks is independent and can be restarted at any moment. Map reduce is resilient to workers failures

If the master failed, then? The Master periodically its status and data structures. Then another master can start from the last checkpoint

Task Granularity

There are M maps tasks and R reduce tasks

M and R should be larger than the number of workers

Dynamic loading and load balancing on workers to optimize resources

Master must make O(M+R) scheduling decisions and keeps O(M*R) states. One byte to save the state of each worker

According to the paper, Google performs M=200,000 and R=5,000 using 2,000 workers

Refinements

Partition function: load balancing

Ordering function: optimized generation of keys and easy to generate sorted output files

Combiner function = Reduce function. See count word in documents example

Input and output Readers: Standard input and output

Skipping bad records: Control of bad input

Local execution for debugging

Status information through an external application

What are the benefits of map reduce?

Easy to use for programmers that don't need to worry about the details of distributed computing

A large set of problems can be expressed in Map reduce programming model

Flexible and scalable in large clusters of machines. The fault tolerance is elegant and works

Programs that can be expressed with Map Reduce

Distributed Grep <word, match>

Count URL Access Frequency <URL, total_count>

Reverse Web-link graph <target, list(source)>

Term-Vector per Host <word, frequency>

Inverted index <word, document ID>

Distributed Sort <key, record>

References

MapReduce: Simplified Data Processing on Large Clusters ( http://labs.google.com/papers/mapreduce-osdi04.pdf)

http://code.google.com/edu/parallel/mapreduce-tutorial.html

www.mapreduce.org

http://www.youtube.com/watch?v=yjPBkvYh-ss&feature=PlayList&p=01ABB666FB64D768&index=0&playnext=1

http://hadoop.apache.org/

Map Reduce

Questions?

map reduce

Technology