introduction to mapreduce
DESCRIPTION
Introduction to MapReduce. Amit K Singh. Do you recognize this ??. “The density of transistors on a chip doubles every 18 months, for the same cost” (1965). “The density of transistors on a chip doubles every 18 months, for the same cost” (1965). The Free Lunch Is Almost Over !!. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/1.jpg)
Introduction to Introduction to MapReduce MapReduce
Amit K Singh
![Page 2: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/2.jpg)
““The density of transistors on a The density of transistors on a chip doubles every 18 months, for chip doubles every 18 months, for the same cost” (1965)the same cost” (1965)
Do you recognize this ??Do you recognize this ??
![Page 3: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/3.jpg)
““The density of transistors on a The density of transistors on a chip doubles every 18 months, for chip doubles every 18 months, for the same cost” (1965)the same cost” (1965)
![Page 4: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/4.jpg)
The Free Lunch Is Almost Over !!The Free Lunch Is Almost Over !!
![Page 5: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/5.jpg)
The Future is Multi-core !!The Future is Multi-core !!Web graphic Super ComputerJanet E. Ward, 2000
Cluster of Desktops
![Page 6: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/6.jpg)
The Future is Multi-core !!The Future is Multi-core !!
Replace specialized powerful Super-Computers with large clusters of commodity hardware
But Distributed programming is inherently complex.
![Page 7: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/7.jpg)
Google’s MapReduce Google’s MapReduce ParadigmParadigm
Platform for reliable, scalable parallel computing
Abstracts issues of distributed and parallel environment from programmer.
Runs over Google File Systems
![Page 8: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/8.jpg)
Detour: Google File Systems (GFS)Detour: Google File Systems (GFS)
Highly scalable distributed file system for large data-intensive applications.
Provides redundant storage of massive amounts of data on cheap and unreliable computers
Provides a platform over which other systems like MapReduce, BigTable operate.
![Page 9: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/9.jpg)
GFS ArchitectureGFS Architecture
![Page 10: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/10.jpg)
MapReduce: InsightMapReduce: Insight
”Consider the problem of counting the
number of occurrences of each word in a large collection of documents”
How would you do it in parallel ?
![Page 11: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/11.jpg)
One possible solutionOne possible solution
![Page 12: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/12.jpg)
MapReduce Programming ModelMapReduce Programming Model Inspired from map and reduce operations
commonly used in functional programming languages like Lisp.
Users implement interface of two primary methods:◦1. Map: (key1, val1) → (key2, val2)◦2. Reduce: (key2, [val2]) → [val3]
![Page 13: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/13.jpg)
Map operationMap operation Map, a pure function, written by the user,
takes an input key/value pair and produces a set of intermediate key/value pairs. ◦e.g. (doc—id, doc-content)
Draw an analogy to SQL, map can be visualized as group-by clause of an aggregate query.
![Page 14: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/14.jpg)
Reduce operationReduce operation On completion of map phase, all the
intermediate values for a given output key are combined together into a list and given to a reducer.
Can be visualized as aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute.
![Page 15: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/15.jpg)
Pseudo-codePseudo-codemap(String input_key, String input_value): // input_key: document name // input_value: document contents
for each word w in input_value: EmitIntermediate(w, "1");
reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts
int result = 0; for each v in intermediate_values:
result += ParseInt(v); Emit(AsString(result));
![Page 16: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/16.jpg)
MapReduce: Execution overviewMapReduce: Execution overview
![Page 17: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/17.jpg)
MapReduce: Execution overviewMapReduce: Execution overview
![Page 18: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/18.jpg)
MapReduce: ExampleMapReduce: Example
![Page 19: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/19.jpg)
MapReduce in Parallel: ExampleMapReduce in Parallel: Example
![Page 20: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/20.jpg)
MapReduce: Runtime EnvironmentMapReduce: Runtime Environment
![Page 21: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/21.jpg)
MapReduce: Fault ToleranceMapReduce: Fault ToleranceHandled via re-execution of tasks.
Task completion committed through master
What happens if Mapper fails ?◦ Re-execute completed + in-progress map tasks
What happens if Reducer fails ?◦ Re-execute in progress reduce tasks
What happens if Master fails ?◦ Potential trouble !!
![Page 22: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/22.jpg)
MapReduce: Refinements MapReduce: Refinements Locality OptimizationLocality Optimization
Leverage GFS to schedule a map task on a
machine that contains a replica of the corresponding input data.
Thousands of machines read input at local disk speed
Without this, rack switches limit read rate
![Page 23: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/23.jpg)
MapReduce: Refinements MapReduce: Refinements Redundant ExecutionRedundant Execution
Slow workers are source of bottleneck,
may delay completion time.
Near end of phase, spawn backup tasks, one to finish first wins.
Effectively utilizes computing power, reducing job completion time by a factor.
![Page 24: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/24.jpg)
MapReduce: Refinements MapReduce: Refinements Skipping Bad Records Skipping Bad Records
Map/Reduce functions sometimes fail for
particular inputs.
Fixing the Bug might not be possible : Third Party Libraries.
On Error◦Worker sends signal to Master◦If multiple error on same record, skip record
![Page 25: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/25.jpg)
MapReduce: Refinements MapReduce: Refinements Miscellaneous Miscellaneous
Combiner Function at Mapper
Sorting Guarantees within each reduce partition.
Local execution for debugging/testing
User-defined counters
![Page 26: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/26.jpg)
MapReduce: MapReduce:
Walk through of One more Application
![Page 27: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/27.jpg)
![Page 28: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/28.jpg)
MapReduce : PageRankMapReduce : PageRank
PageRank models the behavior of a “random surfer”.
C(t) is the out-degree of t, and (1-d) is a damping factor (random jump)
The “random surfer” keeps clicking on successive links at random not taking content into consideration.
Distributes its pages rank equally among all pages it links to.
The dampening factor takes the surfer “getting bored” and typing arbitrary URL.
n
i i
i
tC
tPRddxPR
1 )(
)()1()(
![Page 29: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/29.jpg)
Computing PageRankComputing PageRank
![Page 30: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/30.jpg)
PageRank : Key InsightsPageRank : Key Insights
Effects at each iteration is local. i+1th iteration
depends only on ith iteration
At iteration i, PageRank for individual nodes can be computed independently
![Page 31: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/31.jpg)
PageRank using MapReducePageRank using MapReduce
Use Sparse matrix representation (M)
Map each row of M to a list of PageRank “credit” to assign to out link neighbours.
These prestige scores are reduced to a single PageRank value for a page by aggregating over them.
![Page 32: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/32.jpg)
PageRank using MapReducePageRank using MapReduceMap: distribute PageRank “credit” to link targets
Reduce: gather up PageRank “credit” from multiple sources to compute new PageRank value
Iterate untilconvergence
Source of Image: Lin 2008
![Page 33: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/33.jpg)
Phase 1: Phase 1: Process HTMLProcess HTML
Map task takes (URL, page-content) pairs
and maps them to (URL, (PRinit, list-of-urls))◦PRinit is the “seed” PageRank for URL◦list-of-urls contains all pages pointed to by URL
Reduce task is just the identity function
![Page 34: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/34.jpg)
Phase 2: Phase 2: PageRank DistributionPageRank Distribution
Reduce task gets (URL, url_list) and many
(URL, val) values◦Sum vals and fix up with d to get new PR◦Emit (URL, (new_rank, url_list))
Check for convergence using non parallel component
![Page 35: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/35.jpg)
MapReduce: Some More AppsMapReduce: Some More Apps
Distributed Grep.
Count of URL Access Frequency.
Clustering (K-means)
Graph Algorithms.
Indexing Systems
MapReduce Programs In Google Source Tree
![Page 36: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/36.jpg)
MapReduce: Extensions and MapReduce: Extensions and similar appssimilar apps PIG (Yahoo)
Hadoop (Apache)
DryadLinq (Microsoft)
![Page 37: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/37.jpg)
Large Scale Systems Architecture Large Scale Systems Architecture using MapReduceusing MapReduce
![Page 38: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/38.jpg)
Take Home MessagesTake Home Messages Although restrictive, provides good fit for many problems
encountered in the practice of processing large data sets.
Functional Programming Paradigm can be applied to large scale computation.
Easy to use, hides messy details of parallelization, fault-tolerance, data distribution and load balancing from the programmers.
And finally, if it works for Google, it should be handy !!
![Page 39: Introduction to MapReduce](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d12550346895da6d11f/html5/thumbnails/39.jpg)
Thank You