map reduce -...
TRANSCRIPT
![Page 1: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of](https://reader035.vdocument.in/reader035/viewer/2022071213/602edf4a46831937237a8740/html5/thumbnails/1.jpg)
Map Reduce
1 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms
![Page 2: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of](https://reader035.vdocument.in/reader035/viewer/2022071213/602edf4a46831937237a8740/html5/thumbnails/2.jpg)
2 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms
INPUT
PROCESS
OUTPUT
Typical application
![Page 3: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of](https://reader035.vdocument.in/reader035/viewer/2022071213/602edf4a46831937237a8740/html5/thumbnails/3.jpg)
3 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms
INPUT
PROCESS
OUTPUT
What if…
![Page 4: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of](https://reader035.vdocument.in/reader035/viewer/2022071213/602edf4a46831937237a8740/html5/thumbnails/4.jpg)
4 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms
INPUT
OUTPUT
Divide & Conquer
I1
W1
O1
I2
W2
O2
I3
W3
O3
I4
W4
O4
I5
W5
O5
![Page 5: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of](https://reader035.vdocument.in/reader035/viewer/2022071213/602edf4a46831937237a8740/html5/thumbnails/5.jpg)
Questions
• How do we split the input?
• How do we distribute the input splits?
• How do we collect the output splits?
• How do we aggregate the output?
• How do we coordinate the work?
• What if input splits > num workers?
• What if workers need to share input/output splits?
• What if a worker dies?
• What if we have a new input?
5 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms
![Page 6: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of](https://reader035.vdocument.in/reader035/viewer/2022071213/602edf4a46831937237a8740/html5/thumbnails/6.jpg)
6 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms
![Page 7: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of](https://reader035.vdocument.in/reader035/viewer/2022071213/602edf4a46831937237a8740/html5/thumbnails/7.jpg)
Design ideas
• Scale “out”, not “up” – Low end machines
• Move processing to the data – Network bandwidth bottleneck
• Process data sequentially, avoid random access – Huge data files – Write once, read many
• Seamless scalability – Strive for the unobtainable
• Right level of abstraction – Hide implementation details from applications
development
7 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms
![Page 8: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of](https://reader035.vdocument.in/reader035/viewer/2022071213/602edf4a46831937237a8740/html5/thumbnails/8.jpg)
Typical Large-Data Problem
• Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output
(Dean and Ghemawat, OSDI 2004)
Map Reduce Programming Model
8 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms
![Page 9: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of](https://reader035.vdocument.in/reader035/viewer/2022071213/602edf4a46831937237a8740/html5/thumbnails/9.jpg)
Intermediate Values
Final Value Initial Value
Intermediate list
Input list
Fold
Map
g g g g g
f f f f f
From functional programming…
9 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms
![Page 10: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of](https://reader035.vdocument.in/reader035/viewer/2022071213/602edf4a46831937237a8740/html5/thumbnails/10.jpg)
…To MapReduce • Programmers specify two functions:
map (k1, v1) → [(k2, v2)]
reduce (k2, [v2]) → [(k3, v3)]
• All values with the same key are sent to the same reducer
• Input keys and values (k1, v1) are drawn from different domain than output keys and values (k3, v3)
• Intermediate keys (k2, v2) and values are from the same domain as the output keys and values (k3, v3)
• The runtime handles everything else…
10 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms
![Page 11: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of](https://reader035.vdocument.in/reader035/viewer/2022071213/602edf4a46831937237a8740/html5/thumbnails/11.jpg)
Programming Model (simple)
INPUT
OUTPUT
O1 O2 O3
I1
map
I2
map
I3
map
I4
map
I5
map
Aggregate values by key
reduce reduce reduce
11 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms
![Page 12: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of](https://reader035.vdocument.in/reader035/viewer/2022071213/602edf4a46831937237a8740/html5/thumbnails/12.jpg)
Example (I)
12 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms
![Page 13: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of](https://reader035.vdocument.in/reader035/viewer/2022071213/602edf4a46831937237a8740/html5/thumbnails/13.jpg)
Example (II)
13 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms
Input Hello world Java and MapReduce Hello Java
split 1 Hello world
split 2 Java and
MapReduce
split 3 Hello Java
map 3
(hello, 1)
(java, 1)
map 2
(and, 1)
(mapreduce, 1)
(java, 1)
reduce 1
(hello, 1)
(and, 1)
(hello, 1)
reduce 2
(world, 1)
(mapreduce, 1)
(java, 1)
(java, 1)
map 1
(hello, 1)
(world, 1)
Output (hello, 2) (world, 1) (and, 1) (java, 2) (mapreduce, 1)
![Page 14: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of](https://reader035.vdocument.in/reader035/viewer/2022071213/602edf4a46831937237a8740/html5/thumbnails/14.jpg)
Runtime
• Handles scheduling – Assigns workers to map and reduce tasks
• Handles “data distribution” – Moves processes to data
• Handles synchronization – Gathers, sorts, and shuffles intermediate data
• Handles errors and faults – Detects worker failures and restarts
• Everything happens on top of a distributed FS
14 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms
![Page 15: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of](https://reader035.vdocument.in/reader035/viewer/2022071213/602edf4a46831937237a8740/html5/thumbnails/15.jpg)
Partitioners and combiners
• Programmers specify two functions: map (k1, v1) → [(k2, v2)] reduce (k2, [v2]) → [(k3, v3)] – All values with the same key are reduced together
• The execution framework handles everything else…
• Not quite…usually, programmers also specify:
partition (k2, number of partitions) → partition for k2 – Often a simple hash of the key, e.g., hash(k’) mod n – Divides up key space for parallel reduce operations
combine (k2, v2) → [(k2, v2)] – Mini-reducers that run in memory after the map phase – Used as an optimization to reduce network traffic
15 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms
![Page 16: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of](https://reader035.vdocument.in/reader035/viewer/2022071213/602edf4a46831937237a8740/html5/thumbnails/16.jpg)
Programming Model (complete)
INPUT
I1
map
I2
map
I3
map
I4
map
I5
map
Aggregate values by key
OUTPUT
O1 O2 O3
reduce reduce reduce
16 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms
partition
combine
partition
combine
partition
combine
partition
combine
partition
combine
![Page 17: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of](https://reader035.vdocument.in/reader035/viewer/2022071213/602edf4a46831937237a8740/html5/thumbnails/17.jpg)
MapReduce can refer to…
• The programming model • The execution framework (aka “runtime”) • The specific implementation
17 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms
![Page 18: Map Reduce - didawiki.di.unipi.itdidawiki.di.unipi.it/.../magistraleinformaticanetworking/cpa/mapreduc… · Map Reduce Programming Model MCSN – N. Tonellotto – Complements of](https://reader035.vdocument.in/reader035/viewer/2022071213/602edf4a46831937237a8740/html5/thumbnails/18.jpg)
MapReduce Implementations
• Google has a proprietary implementation in C++ – Bindings in Java, Python
• Hadoop is an open-source implementation in Java – Development led by Yahoo, used in production – Now an Apache project – Rapidly expanding software ecosystem
• Lots of custom research implementations – For GPUs, cell processors, etc.
18 MCSN – N. Tonellotto – Complements of Distributed Enabling Platforms