take a close look at mapreduce xuanhua shi. acknowledgement most of the slides are from dr. bing...

Take a Close Look at Take a Close Look at MaMappRRededucucee

Xuanhua ShiXuanhua Shi

Acknowledgement Acknowledgement

Most of the slides are from Dr. Bing Chen, Most of the slides are from Dr. Bing Chen, http://http://grid.hust.edu.cn/chengbingrid.hust.edu.cn/chengbin//

Some slides are from Some slides are from SHADI IBRAHIM, SHADI IBRAHIM, http://http://grid.hust.edu.cn/shadigrid.hust.edu.cn/shadi//

What is MapReduceWhat is MapReduce

Origin from Google, [OSDI’04]Origin from Google, [OSDI’04] A simple programming model A simple programming model Functional modelFunctional model For large-scale data processingFor large-scale data processing

Exploits large set of commodity computersExploits large set of commodity computers Executes process in distributed mannerExecutes process in distributed manner Offers high availabilityOffers high availability

MotivationMotivation

Lots of demands for very large scale data Lots of demands for very large scale data processingprocessing

A certain common themes for these A certain common themes for these demandsdemands Lots of machines needed (scaling)Lots of machines needed (scaling) Two basic operations on the inputTwo basic operations on the input

MapMap ReduceReduce

Distributed GrepDistributed Grep

Very big

data

Split data

Split data

Split data

Split data

grep

grep

grep

grep

matches

matches

matches

matches

catAll

matches

Distributed Word CountDistributed Word Count

Very big

data

Split data

Split data

Split data

Split data

count

count

count

count

count

count

count

count

mergemergedcount

Map+ReduceMap+Reduce

Map:Map: Accepts Accepts inputinput

key/value pairkey/value pair Emits Emits intermediateintermediate

key/value pairkey/value pair

Reduce :Reduce : Accepts Accepts intermediateintermediate

key/value* pairkey/value* pair Emits Emits outputoutput key/value key/value

pairpair

Very big

data

ResultMAP

REDUCE

PartitioningFunction

The design and how it worksThe design and how it works

Architecture overviewArchitecture overview

Job tracker

Task tracker Task tracker Task tracker

Master node

Slave node 1 Slave node 2 Slave node N

Workers

user

Workers Workers

GFS: underlying storage GFS: underlying storage systemsystem

GoalGoal global viewglobal view make huge files available in the face of node failuresmake huge files available in the face of node failures

Master Node (meta server)Master Node (meta server) Centralized, index all chunks on data serversCentralized, index all chunks on data servers

Chunk server (data server)Chunk server (data server) File is split into contiguous chunks, typically 16-64MB.File is split into contiguous chunks, typically 16-64MB. Each chunk replicated (usually 2x or Each chunk replicated (usually 2x or 3x3x).).

Try to keep replicas in different racks.Try to keep replicas in different racks.

GFS architectureGFS architecture

GFS Master

C0 C1

C2C5

Chunkserver 1

C0 C5

Chunkserver N

C1

C3C5

Chunkserver 2

… C2

Client

Functions in the ModelFunctions in the Model

MapMap Process a key/value pair to generate Process a key/value pair to generate

intermediate key/value pairsintermediate key/value pairs ReduceReduce

Merge all intermediate values associated with Merge all intermediate values associated with the same keythe same key

PartitionPartition By default : By default : hash(key) mod Rhash(key) mod R Well balancedWell balanced

Diagram (1)Diagram (1)

Diagram (2)Diagram (2)

A Simple Example A Simple Example Counting words in a large set of documentsCounting words in a large set of documents

mapmap(string value)(string value)

//key: document name//key: document name

//value: document contents//value: document contents

for each word w in valuefor each word w in value

EmitIntermediateEmitIntermediate(w, “1”);(w, “1”);

reducereduce(string key, iterator values)(string key, iterator values)

//key: word//key: word

//values: list of counts//values: list of counts

int results = 0;int results = 0;

for each v in valuesfor each v in values

result += ParseInt(v);result += ParseInt(v);

EmitEmit(AsString(result));(AsString(result));

How does it work?How does it work?

Locality issueLocality issue

Master scheduling policy Master scheduling policy Asks GFS for locations of replicas of input file blocksAsks GFS for locations of replicas of input file blocks Map tasks typically split into 64MB (== GFS block Map tasks typically split into 64MB (== GFS block

size)size) Map tasks scheduled so GFS input block replica are Map tasks scheduled so GFS input block replica are

on same machine or same rackon same machine or same rack

Effect Effect Thousands of machines read input at local disk speed Thousands of machines read input at local disk speed Without this, rack switches limit read rate Without this, rack switches limit read rate

Fault ToleranceFault Tolerance

Reactive wayReactive way Worker failureWorker failure

Heartbeat, Heartbeat, Workers are periodically pinged by masterWorkers are periodically pinged by master NO response = failed workerNO response = failed worker

If the processor of a worker fails, the tasks of that worker are If the processor of a worker fails, the tasks of that worker are reassigned to another worker.reassigned to another worker.

Master failureMaster failure Master writes periodic checkpointsMaster writes periodic checkpoints Another master can be started from the last checkpointed Another master can be started from the last checkpointed

statestate If eventually the master dies, the job will be abortedIf eventually the master dies, the job will be aborted


Proactive way (Proactive way (Redundant ExecutionRedundant Execution)) The problem of “stragglers” (sThe problem of “stragglers” (slow workers)low workers)

Other jobs consuming resources on machineOther jobs consuming resources on machine Bad disks with soft errors transfer data very slowlyBad disks with soft errors transfer data very slowly Weird things: processor caches disabled (!!)Weird things: processor caches disabled (!!)

When computation almost done, reschedule When computation almost done, reschedule in-progress tasksin-progress tasks

Whenever either the primary or the backup Whenever either the primary or the backup executions finishes, mark it as completedexecutions finishes, mark it as completed


Input error: bad recordsInput error: bad records Map/Reduce functions sometimes fail for particular Map/Reduce functions sometimes fail for particular

inputs inputs Best solution is to debug & fix, but not always Best solution is to debug & fix, but not always

possible possible On segment fault On segment fault

Send UDP packet to master from signal handler Send UDP packet to master from signal handler Include sequence number of record being processed Include sequence number of record being processed

Skip bad recordsSkip bad records If master sees two failures for same record, next worker is If master sees two failures for same record, next worker is

told to skip the recordtold to skip the record

Status monitorStatus monitor

RefinementsRefinements

Task Granularity Minimizes time for fault recoveryMinimizes time for fault recovery load balancingload balancing

Local execution for debugging/testing Local execution for debugging/testing Compression of intermediate dataCompression of intermediate data

Points need to be Points need to be emphasizedemphasized

No No reducereduce can begin until can begin until mapmap is complete is complete Master must communicate locations of Master must communicate locations of

intermediate filesintermediate files Tasks scheduled based on location of dataTasks scheduled based on location of data If If map map worker fails any time before worker fails any time before reduce reduce

finishes, task must be completely rerunfinishes, task must be completely rerun MapReduce library does most of the hard MapReduce library does most of the hard

work for us!work for us!

Model is Widely ApplicableModel is Widely Applicable MapReduce Programs In Google Source TreeMapReduce Programs In Google Source Tree

distributed grep distributed sort web link-graph reversal

term-vector / host web access log stats inverted index construction

document clustering machine learning statistical machine translation

... ... ...

Examples as follows

How to use itHow to use it

User to do list:User to do list: indicate:indicate:

Input/output filesInput/output files MM: number of map tasks: number of map tasks RR: number of reduce tasks: number of reduce tasks WW: number of machines: number of machines

Write Write mapmap and and reducereduce functions functions Submit the jobSubmit the job

Detailed Example: Word Detailed Example: Word Count(1)Count(1)

MapMap


ReduceReduce


Main Main

ApplicationsApplications

String Match, such as GrepString Match, such as Grep Reverse indexReverse index Count URL access frequencyCount URL access frequency Lots of examples in data miningLots of examples in data mining

MapReduce MapReduce ImplementationsImplementations

MapReduce

Cluster, 1, Google2, Apache Hadoop

Multicore CPU, Phoenix @ stanford GPU,

Mars@HKUST

HadoopHadoop

Open sourceOpen source Java-based implementation of MapReduceJava-based implementation of MapReduce Use HDFS as underlying file systemUse HDFS as underlying file system

HadoopHadoop

GoogleGoogle YahooYahoo

MapReduceMapReduce HadoopHadoop

GFSGFS HDFSHDFS

BigtableBigtable HBaseHBase

ChubbyChubby (nothing yet… but (nothing yet… but planned)planned)

Recent news about HadoopRecent news about Hadoop

Apache Hadoop Wins Terabyte Sort Apache Hadoop Wins Terabyte Sort BenchmarkBenchmark

The sort used 1800 maps and 1800 The sort used 1800 maps and 1800 reduces and allocated enough memory to reduces and allocated enough memory to buffers to hold the intermediate data in buffers to hold the intermediate data in memory. memory.

PhoenixPhoenix

The best paper at HPCA’07The best paper at HPCA’07 MapReduce for multiprocessor systemsMapReduce for multiprocessor systems Shared-memory implementation of MapReduceShared-memory implementation of MapReduce

SMP, Multi-coreSMP, Multi-core

FeaturesFeatures Uses thread instead of cluster nodes for parallelismUses thread instead of cluster nodes for parallelism Communicate through shared memory instead of network Communicate through shared memory instead of network

messagesmessages Dynamic scheduling, locality management, fault recoveryDynamic scheduling, locality management, fault recovery

WorkflowWorkflow

The Phoenix APIThe Phoenix API

System-defined functionsSystem-defined functions

User-defined functionsUser-defined functions

Mars: MapReduce on GPUMars: MapReduce on GPU

PACT’08PACT’08

GeForce 8800 GTX, PS3, Xbox360

Implementation of MarsImplementation of Mars

NVIDIA GPU (GeForce 8800 GTX)

CPU (Intel P4 four cores, 2.4GHz)

Operating System (Windows or Linux)

CUDA System calls

MapReduce

User applications.

Implementation of MarsImplementation of Mars

DiscussionDiscussionWe have MPI and PVM,Why do we need MapReduce?We have MPI and PVM,Why do we need MapReduce?

MPI, PVMMPI, PVM MapReduceMapReduce

ObjectiveObjective General distributed General distributed programming modelprogramming model

Large-scale data Large-scale data processingprocessing

AvailabilityAvailability Weaker, harderWeaker, harder betterbetter

Data Data LocalityLocality

MPI-IOMPI-IO GFSGFS

UsabilityUsability Difficult to learnDifficult to learn easiereasier

ConclusionsConclusions

Provide a general-purpose model to Provide a general-purpose model to simplify large-scale computationsimplify large-scale computation

Allow users to focus on the problem Allow users to focus on the problem without worrying about detailswithout worrying about details

ReferencesReferences

Original paper Original paper (http://labs.google.com/papers/mapreduce(http://labs.google.com/papers/mapreduce.html).html)

On wikipedia (On wikipedia (http://http://en.wikipedia.org/wiki/MapReduceen.wikipedia.org/wiki/MapReduce))

Hadoop – MapReduce in Java (Hadoop – MapReduce in Java (http://lucene.apache.org/hadoop/http://lucene.apache.org/hadoop/))

http://code.google.com/edu/parallel/http://code.google.com/edu/parallel/mapreduce-tutorial.htmlmapreduce-tutorial.html

take a close look at mapreduce xuanhua shi. acknowledgement most of the slides are from dr. bing...

Documents