problem-solving on large-scale clusters: theory and applications
DESCRIPTION
Problem-solving on large-scale clusters: theory and applications. Lecture 3: Bringing it all together. Today’s Outline. Course directions, projects, and feedback Quiz 2 Context / Where we are Why do we care about fold() and map() ? - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/1.jpg)
Problem-solving on large-scale clusters: theory and applications
Lecture 3: Bringing it all together
![Page 2: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/2.jpg)
Today’s Outline
• Course directions, projects, and feedback
• Quiz 2
• Context / Where we are– Why do we care about fold() and map()?– Why do we care about parallelization and
data dependencies?
• MapReduce architecture from 10,000 feet
![Page 3: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/3.jpg)
Context and Review
• Data dependencies determine whether a problem can be formulated in MapReduce
• The properties of fold() and map() determine how to formulate a problem in MapReduce
How do you parallelize fold()? map()?
![Page 4: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/4.jpg)
MapReduce Introduction• MapReduce is both a programming model and a
clustered computing system– A specific way of formulating a problem, which yields
good parallelizability– A system which takes a MapReduce-formulated
problem and executes it on a large cluster• Hides implementation details, such as hardware failures,
grouping and sorting, scheduling …
• Previous lectures have focused on MapReduce-the-problem-formulation
• Today will mostly focus on MapReduce-the-system
![Page 5: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/5.jpg)
MR Problem Formulation: Formal Definition
MapReduce:mapreduce fm fr l =
map (reducePerKey fr) (group (map fm l))
reducePerKey fr (k,v_list) =
(k, (foldl (fr k) [] v_list))
– Assume map here is actually concatMap.– Argument l is a list of documents– The result of first map is a list of key-value pairs– The function fr takes 3 arguments key, context, current.
With currying, this allows for locking the value of “key” for each list during the fold.
MapReduce maps a fold over the sorted result of a map!
![Page 6: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/6.jpg)
MR System Overview (1 of 2)Map:
– Preprocesses a set of files to generate intermediate key-value pairs
– As parallelized as you want
Group:– Partitions intermediate key-value pairs by unique key, generating
a list of all associated values
Reduce:– For each key, iterates over value list– Performs computation that requires context between iterations– Parallelizable amongst different keys, but not within one key
![Page 7: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/7.jpg)
MR System Overview (2 of 2)
Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
![Page 8: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/8.jpg)
Example: MapReduce DocInfo (1 of 2)
MapReduce:mapreduce fm fr l =
map (reducePerKey fr) (group (map fm l))
reducePerKey fr (k,v_list) = (k, (foldl (fr k) [] v_list)
Pseudocode for fm
fm contents = concat [
[(“spaces”, (count_spaces contents))], (map (emit “raw”) (split contents)), (map (emit “scrub”) (scrub (split contents)))] emit label value = (label, (value, 1))
![Page 9: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/9.jpg)
Example: MapReduce DocInfo (2 of 2)
MapReduce:mapreduce fm fr l =
map (reducePerKey fr) (group (map fm l))
reducePerKey fr (k,v_list) =
(k, (foldl (fr k) [] v_list)
Pseudocode for fr
fr ‘spaces’ count (total:xs) =(total+count:xs)
fr ‘raw’ (word,count) (result) =(update_result (word,count) result)
fr ‘scrub’ (word,count) (result) =(update_result (word,count) result)
![Page 10: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/10.jpg)
Group ExerciseFormulate the following as map reduces:1. Find the set of unique words in a document
a) Input: a bunch of wordsb) Output: all the unique words (no repeats)
2. Calculate per-employee taxesa) Input: a list of (employee, salary, month) tuplesb) Output: a list of (employee, taxes due) pairs
3. Randomly reorder sentencesa) Input: a bunch of documentsb) Output: all sentences in random order (may include duplicates)
4. Compute the minesweeper grid/mapa) Input: coordinates for the location of minesb) Output: coordinate/value pairs for all non-zero cells
Can you think generalized techniques for decomposing problems?
![Page 11: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/11.jpg)
MapReduce Parallelization: Execution
Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
![Page 12: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/12.jpg)
MapReduce Parallelization: Pipelining• Finely granular tasks: many more map tasks than machines
– Better dynamic load balancing
– Minimizes time for fault recovery
– Can pipeline the shuffling/grouping while maps are still running
• Example: 2000 machines -> 200,000 map + 5000 reduce tasks
Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
![Page 13: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/13.jpg)
Example: MR DocInfo, revisitedDo MapReduce DocInfo in 2 passes (instead of 1), performing all the
work in the “group” step
Map1: 1. Tokenize document2. For each token output:
a) (“raw:<word>”,1)b) (“scrubbed:<scrubbed_word>”, 1)
Reduce1:1. For each key, ignore value list and output (key,1)
Map2:1. Tokenize document2. For each token “type:value”, output (type,1)
Reduce 2:• For each key, output (key, (sum values))
![Page 14: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/14.jpg)
Example: MR DocInfo, revisited
• Of the 2 DocInfo MapReduce implementations, which is better?
• Define “better”. What resources are you considering?Dev time? CPU? Network? Disk? Complexity? Reusability?
Mapper
Mapper
Mapper
Reducer
Reducer
GFS
Key:• Connections are network
links• GFS is a cluster of
storage machines
![Page 15: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/15.jpg)
HaDoop-as-MapReducemapreduce fm fr l =
map (reducePerKey fr) (group (map fm l))
reducePerKey fr (k,v_list) =
(k, (foldl (fr k) [] v_list)
Hadoop:• The fm and fr are function objects (classes)• Class for fm implements the Mapper interface
Map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter)
• Class for fr implements the Reducer interface
reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter
reporter)Hadoop takes the generated class files and manages running them
![Page 16: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/16.jpg)
Bonus Materials: MR Runtime
• The following slides illustrate an example run of MapReduce on a Google cluster
• A sample job from the indexing pipeline, processes ~900 GB of crawled pages
![Page 17: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/17.jpg)
MR Runtime (1 of 9)
Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
![Page 18: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/18.jpg)
MR Runtime (2 of 9)
Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
![Page 19: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/19.jpg)
MR Runtime (3 of 9)
Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
![Page 20: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/20.jpg)
MR Runtime (4 of 9)
Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
![Page 21: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/21.jpg)
MR Runtime (5 of 9)
Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
![Page 22: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/22.jpg)
MR Runtime (6 of 9)
Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
![Page 23: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/23.jpg)
MR Runtime (7 of 9)
Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
![Page 24: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/24.jpg)
MR Runtime (8 of 9)
Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html
![Page 25: Problem-solving on large-scale clusters: theory and applications](https://reader035.vdocument.in/reader035/viewer/2022062321/568134b4550346895d9bd35a/html5/thumbnails/25.jpg)
MR Runtime (9 of 9)
Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html