take a close look at mapreduce xuanhua shi. acknowledgement most of the slides are from dr. bing...
TRANSCRIPT
Take a Close Look at Take a Close Look at MaMappRRededucucee
Xuanhua ShiXuanhua Shi
Acknowledgement Acknowledgement
Most of the slides are from Dr. Bing Chen, Most of the slides are from Dr. Bing Chen, http://http://grid.hust.edu.cn/chengbingrid.hust.edu.cn/chengbin//
Some slides are from Some slides are from SHADI IBRAHIM, SHADI IBRAHIM, http://http://grid.hust.edu.cn/shadigrid.hust.edu.cn/shadi//
What is MapReduceWhat is MapReduce
Origin from Google, [OSDI’04]Origin from Google, [OSDI’04] A simple programming model A simple programming model Functional modelFunctional model For large-scale data processingFor large-scale data processing
Exploits large set of commodity computersExploits large set of commodity computers Executes process in distributed mannerExecutes process in distributed manner Offers high availabilityOffers high availability
MotivationMotivation
Lots of demands for very large scale data Lots of demands for very large scale data processingprocessing
A certain common themes for these A certain common themes for these demandsdemands Lots of machines needed (scaling)Lots of machines needed (scaling) Two basic operations on the inputTwo basic operations on the input
MapMap ReduceReduce
Distributed GrepDistributed Grep
Very big
data
Split data
Split data
Split data
Split data
grep
grep
grep
grep
matches
matches
matches
matches
catAll
matches
Distributed Word CountDistributed Word Count
Very big
data
Split data
Split data
Split data
Split data
count
count
count
count
count
count
count
count
mergemergedcount
Map+ReduceMap+Reduce
Map:Map: Accepts Accepts inputinput
key/value pairkey/value pair Emits Emits intermediateintermediate
key/value pairkey/value pair
Reduce :Reduce : Accepts Accepts intermediateintermediate
key/value* pairkey/value* pair Emits Emits outputoutput key/value key/value
pairpair
Very big
data
ResultMAP
REDUCE
PartitioningFunction
The design and how it worksThe design and how it works
Architecture overviewArchitecture overview
Job tracker
Task tracker Task tracker Task tracker
Master node
Slave node 1 Slave node 2 Slave node N
Workers
user
Workers Workers
GFS: underlying storage GFS: underlying storage systemsystem
GoalGoal global viewglobal view make huge files available in the face of node failuresmake huge files available in the face of node failures
Master Node (meta server)Master Node (meta server) Centralized, index all chunks on data serversCentralized, index all chunks on data servers
Chunk server (data server)Chunk server (data server) File is split into contiguous chunks, typically 16-64MB.File is split into contiguous chunks, typically 16-64MB. Each chunk replicated (usually 2x or Each chunk replicated (usually 2x or 3x3x).).
Try to keep replicas in different racks.Try to keep replicas in different racks.
GFS architectureGFS architecture
GFS Master
C0 C1
C2C5
Chunkserver 1
C0 C5
Chunkserver N
C1
C3C5
Chunkserver 2
… C2
Client
Functions in the ModelFunctions in the Model
MapMap Process a key/value pair to generate Process a key/value pair to generate
intermediate key/value pairsintermediate key/value pairs ReduceReduce
Merge all intermediate values associated with Merge all intermediate values associated with the same keythe same key
PartitionPartition By default : By default : hash(key) mod Rhash(key) mod R Well balancedWell balanced
Diagram (1)Diagram (1)
Diagram (2)Diagram (2)
A Simple Example A Simple Example Counting words in a large set of documentsCounting words in a large set of documents
mapmap(string value)(string value)
//key: document name//key: document name
//value: document contents//value: document contents
for each word w in valuefor each word w in value
EmitIntermediateEmitIntermediate(w, “1”);(w, “1”);
reducereduce(string key, iterator values)(string key, iterator values)
//key: word//key: word
//values: list of counts//values: list of counts
int results = 0;int results = 0;
for each v in valuesfor each v in values
result += ParseInt(v);result += ParseInt(v);
EmitEmit(AsString(result));(AsString(result));
How does it work?How does it work?
Locality issueLocality issue
Master scheduling policy Master scheduling policy Asks GFS for locations of replicas of input file blocksAsks GFS for locations of replicas of input file blocks Map tasks typically split into 64MB (== GFS block Map tasks typically split into 64MB (== GFS block
size)size) Map tasks scheduled so GFS input block replica are Map tasks scheduled so GFS input block replica are
on same machine or same rackon same machine or same rack
Effect Effect Thousands of machines read input at local disk speed Thousands of machines read input at local disk speed Without this, rack switches limit read rate Without this, rack switches limit read rate
Fault ToleranceFault Tolerance
Reactive wayReactive way Worker failureWorker failure
Heartbeat, Heartbeat, Workers are periodically pinged by masterWorkers are periodically pinged by master NO response = failed workerNO response = failed worker
If the processor of a worker fails, the tasks of that worker are If the processor of a worker fails, the tasks of that worker are reassigned to another worker.reassigned to another worker.
Master failureMaster failure Master writes periodic checkpointsMaster writes periodic checkpoints Another master can be started from the last checkpointed Another master can be started from the last checkpointed
statestate If eventually the master dies, the job will be abortedIf eventually the master dies, the job will be aborted
Fault ToleranceFault Tolerance
Proactive way (Proactive way (Redundant ExecutionRedundant Execution)) The problem of “stragglers” (sThe problem of “stragglers” (slow workers)low workers)
Other jobs consuming resources on machineOther jobs consuming resources on machine Bad disks with soft errors transfer data very slowlyBad disks with soft errors transfer data very slowly Weird things: processor caches disabled (!!)Weird things: processor caches disabled (!!)
When computation almost done, reschedule When computation almost done, reschedule in-progress tasksin-progress tasks
Whenever either the primary or the backup Whenever either the primary or the backup executions finishes, mark it as completedexecutions finishes, mark it as completed
Fault ToleranceFault Tolerance
Input error: bad recordsInput error: bad records Map/Reduce functions sometimes fail for particular Map/Reduce functions sometimes fail for particular
inputs inputs Best solution is to debug & fix, but not always Best solution is to debug & fix, but not always
possible possible On segment fault On segment fault
Send UDP packet to master from signal handler Send UDP packet to master from signal handler Include sequence number of record being processed Include sequence number of record being processed
Skip bad recordsSkip bad records If master sees two failures for same record, next worker is If master sees two failures for same record, next worker is
told to skip the recordtold to skip the record
Status monitorStatus monitor
RefinementsRefinements
Task Granularity Minimizes time for fault recoveryMinimizes time for fault recovery load balancingload balancing
Local execution for debugging/testing Local execution for debugging/testing Compression of intermediate dataCompression of intermediate data
Points need to be Points need to be emphasizedemphasized
No No reducereduce can begin until can begin until mapmap is complete is complete Master must communicate locations of Master must communicate locations of
intermediate filesintermediate files Tasks scheduled based on location of dataTasks scheduled based on location of data If If map map worker fails any time before worker fails any time before reduce reduce
finishes, task must be completely rerunfinishes, task must be completely rerun MapReduce library does most of the hard MapReduce library does most of the hard
work for us!work for us!
Model is Widely ApplicableModel is Widely Applicable MapReduce Programs In Google Source TreeMapReduce Programs In Google Source Tree
distributed grep distributed sort web link-graph reversal
term-vector / host web access log stats inverted index construction
document clustering machine learning statistical machine translation
... ... ...
Examples as follows
How to use itHow to use it
User to do list:User to do list: indicate:indicate:
Input/output filesInput/output files MM: number of map tasks: number of map tasks RR: number of reduce tasks: number of reduce tasks WW: number of machines: number of machines
Write Write mapmap and and reducereduce functions functions Submit the jobSubmit the job
Detailed Example: Word Detailed Example: Word Count(1)Count(1)
MapMap
Detailed Example: Word Detailed Example: Word Count(2)Count(2)
ReduceReduce
Detailed Example: Word Detailed Example: Word Count(3)Count(3)
Main Main
ApplicationsApplications
String Match, such as GrepString Match, such as Grep Reverse indexReverse index Count URL access frequencyCount URL access frequency Lots of examples in data miningLots of examples in data mining
MapReduce MapReduce ImplementationsImplementations
MapReduce
Cluster, 1, Google2, Apache Hadoop
Multicore CPU, Phoenix @ stanford GPU,
Mars@HKUST
HadoopHadoop
Open sourceOpen source Java-based implementation of MapReduceJava-based implementation of MapReduce Use HDFS as underlying file systemUse HDFS as underlying file system
HadoopHadoop
GoogleGoogle YahooYahoo
MapReduceMapReduce HadoopHadoop
GFSGFS HDFSHDFS
BigtableBigtable HBaseHBase
ChubbyChubby (nothing yet… but (nothing yet… but planned)planned)
Recent news about HadoopRecent news about Hadoop
Apache Hadoop Wins Terabyte Sort Apache Hadoop Wins Terabyte Sort BenchmarkBenchmark
The sort used 1800 maps and 1800 The sort used 1800 maps and 1800 reduces and allocated enough memory to reduces and allocated enough memory to buffers to hold the intermediate data in buffers to hold the intermediate data in memory. memory.
PhoenixPhoenix
The best paper at HPCA’07The best paper at HPCA’07 MapReduce for multiprocessor systemsMapReduce for multiprocessor systems Shared-memory implementation of MapReduceShared-memory implementation of MapReduce
SMP, Multi-coreSMP, Multi-core
FeaturesFeatures Uses thread instead of cluster nodes for parallelismUses thread instead of cluster nodes for parallelism Communicate through shared memory instead of network Communicate through shared memory instead of network
messagesmessages Dynamic scheduling, locality management, fault recoveryDynamic scheduling, locality management, fault recovery
WorkflowWorkflow
The Phoenix APIThe Phoenix API
System-defined functionsSystem-defined functions
User-defined functionsUser-defined functions
Mars: MapReduce on GPUMars: MapReduce on GPU
PACT’08PACT’08
GeForce 8800 GTX, PS3, Xbox360
Implementation of MarsImplementation of Mars
NVIDIA GPU (GeForce 8800 GTX)
CPU (Intel P4 four cores, 2.4GHz)
Operating System (Windows or Linux)
CUDA System calls
MapReduce
User applications.
Implementation of MarsImplementation of Mars
DiscussionDiscussionWe have MPI and PVM,Why do we need MapReduce?We have MPI and PVM,Why do we need MapReduce?
MPI, PVMMPI, PVM MapReduceMapReduce
ObjectiveObjective General distributed General distributed programming modelprogramming model
Large-scale data Large-scale data processingprocessing
AvailabilityAvailability Weaker, harderWeaker, harder betterbetter
Data Data LocalityLocality
MPI-IOMPI-IO GFSGFS
UsabilityUsability Difficult to learnDifficult to learn easiereasier
ConclusionsConclusions
Provide a general-purpose model to Provide a general-purpose model to simplify large-scale computationsimplify large-scale computation
Allow users to focus on the problem Allow users to focus on the problem without worrying about detailswithout worrying about details
ReferencesReferences
Original paper Original paper (http://labs.google.com/papers/mapreduce(http://labs.google.com/papers/mapreduce.html).html)
On wikipedia (On wikipedia (http://http://en.wikipedia.org/wiki/MapReduceen.wikipedia.org/wiki/MapReduce))
Hadoop – MapReduce in Java (Hadoop – MapReduce in Java (http://lucene.apache.org/hadoop/http://lucene.apache.org/hadoop/))
http://code.google.com/edu/parallel/http://code.google.com/edu/parallel/mapreduce-tutorial.htmlmapreduce-tutorial.html