resilient distributed datasets: a fault-tolerant …ey204/teaching/acs/r212_2015_2016/...resilient...

21
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Christopher Little M. Zaharia et al. (2012)

Upload: others

Post on 05-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Christopher Little

M. Zaharia et al. (2012)

Page 2: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

OutlineContext

Resilient Distributed Datasets

Spark

Evaluation

Page 3: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Context.

Page 4: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Input Data

On Disk

Tuples

On Disk

Tuples

On Disk

Tuples

On Disk

Output Data

On Disk

MapReduce

Page 5: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Input Data

On Disk

Tuples

On Disk

Tuples

On Disk

Tuples

On Disk

Output Data

On Disk

MapReduce

Google Pregel

Page 6: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Dryad

Page 7: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Distributed Shared Memory

Page 8: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Resilient Distributed Datasets (RDD).

Page 9: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Structure of RDDs

Partition 1

Partition 2

Partition 3

Page 10: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Structure of RDDs

Partition 1

Partition 2

Partition 3

Partition 1

Partition 2

Partition 3

Stable Storage

Page 11: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Structure of RDDs

Partition 1

Partition 2

Partition 3

Partition 1

Partition 2

Partition 3

map/filter

Page 12: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Structure of RDDs

Partition 1

Partition 2

Partition 1

Partition 2

Partition 3

groupByKey

Page 13: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Spark.

Page 14: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Spark example

lines = spark.textFile( "hdfs://..." )errors = lines.filter( _.startsWith( "ERROR" ) )

errors.persist()

// Count errors mentioning MySQL:errors.filter( _.contains( "MySQL" ) ) .count() // Nothing computed until now

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter( _.contains( "HDFS" ) ) .map( _.split( '\t' )( 3 ) ) .collect( )

Page 15: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Scheduling

Page 16: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Evaluation.

Page 17: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

20xSpark claims to be up to 20x faster than Hadoop in iterative applications

Page 18: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Figure 8: Running times for iterations after the first in Hadoop, HadoopBinMem, and Spark. The jobs all processed 100 GB.

Page 19: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Figure 12(logistic regression using 100 GB

on 25 nodes)

Figure 8(logistic regression using 100 GB)

Page 20: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
Page 21: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Spark promises a lot, but the evidence presented here is insufficient.(but it could live up to their claims)