mapreduce & spark - paris dauphine universitycolazzo/bd/partie2.1.pdf · • resilient...

30
MapReduce & Spark Adway – Groupe SQUARE December 10, 2015 Dario Colazzo Professeur des Universités

Upload: others

Post on 05-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

MapReduce & SparkAdway–GroupeSQUARE

December10,2015

Dario Colazzo Professeur des Universités

Page 2: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

Source: Google

MapReduce!

Page 3: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

Typical Big Data Problem!

!  Iterate over a large number of records!

!  Extract something of interest from each!

!  Shuffle and sort intermediate results!

!  Aggregate intermediate results!

!  Generate final output!

Key idea: provide a functional abstraction for these two operations!

Map!

Reduce!

(Dean and Ghemawat, OSDI 2004)

Page 4: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

MapReduce!

!  Programmers specify two functions:!map (k1, v1) → [<k2, v2>]!reduce (k2, [v2]) → [<k3, v3>]!"  All values with the same key are sent to the same reducer!

!  The execution framework handles everything else…!

Page 5: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

map map map map

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

k1 k2 k3 k4 k5 k6 v1 v2 v3 v4 v5 v6

b a 1 2 c c 3 6 a c 5 2 b c 7 8

a 1 5 b 2 7 c 2 3 6 8

r1 s1 r2 s2 r3 s3

Page 6: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

Word Count: Baseline!

What’s the impact of combiners?!

Page 7: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

MapReduce!

!  Programmers specify two functions:!map (k, v) → <k’, v’>*!reduce (k’, v’) → <k’, v’>*!"  All values with the same key are sent to the same reducer!

!  The execution framework handles everything else…!

What’s “everything else”?!

Page 8: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

MapReduce “Runtime”!

!  Handles scheduling!"  Assigns workers to map and reduce tasks!

!  Handles “data distribution”!"  Moves processes to data!

!  Handles synchronization!"  Gathers, sorts, and shuffles intermediate data!

!  Handles errors and faults!"  Detects worker failures and restarts!

!  Everything happens on top of a distributed filesystem!

Page 9: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

MapReduce!

!  Programmers specify two functions:!map (k, v) → <k’, v’>*!reduce (k’, v’) → <k’, v’>*!"  All values with the same key are reduced together!

!  The execution framework handles everything else…!

!  Not quite…usually, programmers also specify:!partition (k’, number of partitions) → partition for k’!"  Often a simple hash of the key, e.g., hash(k’) mod n!"  Divides up key space for parallel reduce operations!combine (k’, v’) → <k’, v’>*!"  Mini-reducers that run in memory after the map phase!"  Used as an optimization to reduce network traffic!

Page 10: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

combine combine combine combine

b a 1 2 c 9 a c 5 2 b c 7 8

partition partition partition partition

map map map map

k1 k2 k3 k4 k5 k6 v1 v2 v3 v4 v5 v6

b a 1 2 c c 3 6 a c 5 2 b c 7 8

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

a 1 5 b 2 7 c 2 9 8

r1 s1 r2 s2 r3 s3

c 2 3 6 8

Page 11: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

Two more details…!

!  Barrier between map and reduce phases!"  But intermediate data can be copied over as soon as mappers finish!

!  Keys arrive at each reducer in sorted order!"  No enforced ordering across reducers!

Page 12: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

What’s the big deal?!

!  Developers need the right level of abstraction!"  Moving beyond the von Neumann architecture!"  We need better programming models!

!  Abstractions hide low-level details from the developers!"  No more race conditions, lock contention, etc.!

!  MapReduce separating the what from how!"  Developer specifies the computation that needs to be performed!

"  Execution framework (“runtime”) handles actual execution!

Page 13: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

Source: Google

The datacenter is the computer!!

Page 14: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

MapReduce Implementations!

!  Google has a proprietary implementation in C++!"  Bindings in Java, Python!

!  Hadoop is an open-source implementation in Java!"  Development led by Yahoo, now an Apache project!

"  Used in production at Yahoo, Facebook, Twitter, LinkedIn, Netflix, …!

"  The de facto big data processing platform!

"  Rapidly expanding software ecosystem!

!  Lots of custom research implementations!"  For GPUs, cell processors, etc.!

Page 15: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

MapReduce algorithm design!

!  The execution framework handles “everything else”…!"  Scheduling: assigns workers to map and reduce tasks!"  “Data distribution”: moves processes to data!

"  Synchronization: gathers, sorts, and shuffles intermediate data!

"  Errors and faults: detects worker failures and restarts!

!  Limited control over data and execution flow!"  All algorithms must expressed in m, r, c, p!

!  You don’t know:!"  Where mappers and reducers run!

"  When a mapper or reducer begins or finishes!"  Which input a particular mapper is processing!

"  Which intermediate key a particular reducer is processing!

Page 16: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

Implementation Details!

Source: www.flickr.com/photos/8773361@N05/2524173778/

Page 17: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

Adapted from (Ghemawat et al., SOSP 2003)

(file name, block id)

(block id, block location)

instructions to datanode

datanode state (block id, byte range)

block data

HDFS namenode

HDFS datanode

Linux file system

HDFS datanode

Linux file system

File namespace /foo/bar

block 3df2

Application

HDFS Client

HDFS Architecture!

Page 18: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

Putting everything together…!

datanode daemon

Linux file system

tasktracker

slave node

datanode daemon

Linux file system

tasktracker

slave node

datanode daemon

Linux file system

tasktracker

slave node

namenode

namenode daemon

job submission node

jobtracker

Page 19: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

Shuffle and Sort!

Mapper!

Reducer!

other mappers!

other reducers!

circular buffer "(in memory)!

spills (on disk)!

merged spills "(on disk)!

intermediate files "(on disk)!

Combiner!

Combiner!

Page 20: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

Local Aggregation!

Source: www.flickr.com/photos/bunnieswithsharpteeth/490935152/

Page 21: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

Importance of Local Aggregation!

!  Ideal scaling characteristics:!"  Twice the data, twice the running time!"  Twice the resources, half the running time!

!  Why can’t we achieve this?!"  Synchronization requires communication!

"  Communication kills performance (network is slow!)!

!  Thus… avoid communication!!"  Reduce intermediate data via local aggregation!

"  Combiners can help!

Page 22: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

Word Count: Baseline!

What’s the impact of combiners?!

Page 23: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

Word Count: Version 1!

Are combiners still needed?!

Page 24: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

Word Count: Version 2!

Are combiners still needed?!

Page 25: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

Design Pattern for Local Aggregation!

!  “In-mapper combining”!"  Fold the functionality of the combiner into the mapper by preserving

state across multiple map calls!

!  Advantages!

"  Speed!"  Why is this faster than actual combiners?!

!  Disadvantages!"  Explicit memory management required!

"  Potential for order-dependent bugs!

Page 26: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

Combiner Design!

!  Combiners and reducers share same method signature!"  Sometimes, reducers can serve as combiners!"  Often, not…!

!  Remember: combiner are optional optimizations!"  Should not affect algorithm correctness!

"  May be run 0, 1, or multiple times!

!  Example: find average of integers associated with the same key!

Page 27: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

Computing the Mean: Version 1!

Why can’t we use reducer as combiner?!

Page 28: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

Computing the Mean: Version 2!

Why doesn’t this work?!

Page 29: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

Computing the Mean: Version 3!

Fixed?

Page 30: MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

Credits & bibliography• Credits:

• Amir H. Payberah, Swedish Institute of Computer Science

• Jimmy Lin, University of Maryland

• Biblio:

• Data-Intensive Text Processing with MapReduce. Jimmy Lin and Chris Dyer. Morgan & Claypool Publishers, 2010.

• Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012