mapreduce: simpli ed data processing on large …overview basic functionality re nements performance...

MapReduce: Simplified Data Processing on LargeClusters

Jeffrey Dean and Sanjay Ghemawat

Google

Communications of the ACM (CACM), January 2008/Vol51,No 1

Overview Basic Functionality Refinements Performance Conclusion

Earlier Paper

Dean, J. and Ghemawat, S. 2004.MapReduce: Simplified data processing on large clusters.

In Proceedings of Operating Systems Design and Implementation(OSDI). San Francisco, CA. 137-150.

MapReduce: Simplified Data Processing on Large Clusters


Motivation:

Process large amounts of data to produce other data

Use hundreds or thousands of CPUs

Make this easy

Map Reduce provides:

Automatic parallelization and distribution

Fault tolerance

I/O scheduling

Status and monitoring



Programming Model

Two programmer specified functions:

Map

Input: key/value pairs → (k1,v1)Output: intermediate key/value pairs → list(k2,v2)

Reduce

Input: intermediate key/value pairs → (k2,list(v2))Output: List of values → list(v2)



Counting Example

map(String key, String value):

//key: document name

//value: document contents

for each word w in value:

EmitIntermediate(w, “1”);

reduce(String key, Iterator values):

// key: a word

//values: a list of counts

int result = 0;

for each v in values:

result += ParseInt(v);

Emit(AsString(result));



Other Examples

Distributed Grep: The map function emits a line if itmatches a supplied pattern. The reduce function is an identityfunction that just copies the supplied intermediate data to theoutput

Count of URL Access Frequency: The map functionprocesses logs of web page requests and outputs 〈URL, 1〉.The reduce function adds together all values for the sameURL and emits a 〈URL, totalcount〉 pair.



Other Examples

Reverse Web-Link Graph: The map function outputs〈target, source〉 pairs for each link to a target URL found in apage named source. The reduce function concatenates the listof all source URLs associated with a given target URL andemits the pair: 〈target,list(source)〉

Term-Vector per Host: A term vector summarizes the mostimportant words that occur in a document or a set ofdocuments as a list of 〈word , frequency〉 pairs. The mapfunction emits a 〈hostname, term vector〉 pair for each inputdocument (where the hostname is extracted from the URL ofthe document). The reduce function is passed allper-document term vectors for a given host. It ads these termvectors together, throwing away infrequent terms, then emitsa final 〈hostname, term vector〉 pair.



Other Examples

Reverse Web-Link Graph: The map function outputs〈target, source〉 pairs for each link to a target URL found in apage named source. The reduce function concatenates the listof all source URLs associated with a given target URL andemits the pair: 〈target,list(source)〉Term-Vector per Host: A term vector summarizes the mostimportant words that occur in a document or a set ofdocuments as a list of 〈word , frequency〉 pairs. The mapfunction emits a 〈hostname, term vector〉 pair for each inputdocument (where the hostname is extracted from the URL ofthe document). The reduce function is passed allper-document term vectors for a given host. It ads these termvectors together, throwing away infrequent terms, then emitsa final 〈hostname, term vector〉 pair.



Other Examples

Inverted Index: The map function parses each document,emits a sequence of 〈word , document ID〉 pairs. The reducefunction accepts all pairs for a given word, sorts thecorresponding document IDs and emits a〈word , list(document ID)〉 pair. The set of all output pairsform a simple inverted index. It is easy to augment thiscomputation to keep track of word positions

Distributed Sort: The map function extracts the key fromeach record, and emits a 〈key , record〉 pair. The reducefunction emits all pairs unchanged. This computation dependson the partitioning and ordering functions of the MapReduceLibrary.



1. fork

Split input files



2. assign map/reduce

Assign map and reduce tasks to idle workers



3. read

Read the contents of corresponding file split



4. local write

The output of map is written locally



5. remote read

Intermediate keys are remotely read to same places



6. write

Reduce workers work and write to a file output file



Master Data Structures

For each task:

Store idle, in-progress, completed

Tracks intermediate data of in-progress nodes

Uses this data structure for fault tolerance



Worker Failures

Master pings workers

If worker doesn’t answer, worker is marked as failed

All jobs (both in-progress and completed) from that workerare reset to idle

Nuances:

Completed map tasks are rescheduled because of data localityand thus inaccessibleCompleted reduce tasks are stored globally



Locality

Network transfer are expensive

Every worker has subsets of the data

When available, tasks are assigned to workers that alreadyhave the necessary data



Task Granularity

Map phase is divided into M pieces (usually 16-64 MB)

Reduce phase is divided into R pieces

M and R are usually much larger the number of workers

This improves dynamic load balancing



Task Granularity

Upper bounds on granularity:

O(M+R) scheduling decisionsO(M*R) data structure state in memory

R is also constrained by output requirements

In practice:

An M so that each task is is 16-64 MB makes localoptimization most effectiveAn R that is a small multiple of the number of workers

Example:

M=200,000R=5,000Workers = 2,000



Backup Tasks

Stragglers are an issue

When close to completion (Idle workers is larger than idletasks), reschedule in progress tasks



Partitioning

R is the number of reduce tasks/output files.

Data is partitioned R times using the intermediate keys

e.g. “hash(key) mod R”

Results in well balanced partitions

Can specify special partitioning function

“hash(Hostname(urlkey)) mod R”



Ordering Guarantees

Within a partition, reducers process intermediate key/valuepairs in increasing key order

Makes it easy to generate sorted output files per partition



Combiner Function

Significant repetition of intermediate key

Example:

〈the, 1〉 occurs many times (Zipf distribution)

A combiner function can be specified

Each map worker pre-processes using the combiner beforesending data over the network

Partial combining significantly speeds up certain classes ofoperations (2004 paper has example in Appendix A)



Other Refinements

Input and Output Types: custom filetypes through a readerinterface

Side Effects: special reduce worker output files

Skipping Bad Records: Bugs occur (sometimes outsideprogrammer control such as in 3rd party libraries)deterministically on specific records. MapReduce can detectthis and begin skipping them to avoid crashes

Local Execution: Alternative (sequential) implementation fordebugging

Status Information: The master maintains an HTTP serverfor worker and job status

Counters: MapReduce can count a variety of events andprovide statistics on them



Grep

1010 100-Bye records for rare three-character pattern

64 MB Map pieces (M=15000)

Output in 1 file (R=1)



Sort

1010 records

Modeled after TeraSort benchmark



Sort



When is it not appropriate?

No Real-time Processing (MR is best suited to batch jobs)

Not always easy to implement things as an MR

No talk between tasks

Shuffling data is still expensive



Conclusion

MapReduce is fairly ubiquitous in Google operations

Used for indexing and many other applications

Allows programmers to not think about parallelization toohard

Handles failures elegantly

Questions?


mapreduce: simpli ed data processing on large …overview basic functionality re nements performance...

Documents