resilient distributed datasets: a fault-tolerant ...iwanicki/courses/ds/2012/presentations… ·...

15
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Presentation by Zbigniew Chlebicki based on paper by Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica; University of California, Berkeley. Some images and code samples are from paper, presentation for NSDI or Spark Project website ( http://spark-project.org/ ).

Upload: others

Post on 05-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for

In-Memory Cluster Computing

Presentation by Zbigniew Chlebicki based on paper by Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica; University of California, Berkeley.

Some images and code samples are from paper, presentation for NSDI or Spark Project website ( http://spark-project.org/ ).

Page 2: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

MapReduce in Hadoop

Page 3: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Resilient Distributed Datasets (RDD)

● Immutable, partitioned collection of records● Created by deterministic coarse-grained

transformations● Materialized on action● Fault-tolerant through lineage● Controllable persistence and partitioning

Page 4: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Example: Log mining

val file = spark.textFile(“hdfs://…”)

val errors = file.filter(

line => line.contains(“ERROR”)

).cache()

// Count all the errors

errors.count()

// Count errors mentioning MySQL

errors.filter(line => line.contains(“MySQL”)).count()

// Fetch the MySQL errors as an array of strings

errors.filter(line => line.contains(“MySQL”)).collect()

Page 5: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Example: Logistic Regression

val points = spark.textFile(…).map(parsePoint).cache()

var w = Vector.random(D) // current separating plane

for (i <- 1 to ITERATIONS) {

val gradient = points.map(p =>

(1 / (1 + exp(-p.y*(w dot p.x))) – 1) * p.y * p.x

).reduce(_ + _)

w -= gradient

}

println(“Final separating plane: “ + w)

Page 6: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Example: PageRank

links = // RDD of (url, neighbors) pairs

ranks = // RDD of (url, rank) pairs

for (i <- 1 to ITERATIONS) {

ranks = links.join(ranks).flatMap {

(url, (links, rank)) =>

links.map(dest => (dest, rank/links.size))

}.reduceByKey(_ + _)

}

Page 7: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Representation

abstract def compute(split: Split): Iterator[T]

abstract val dependencies: List[spark.Dependency[_]]

abstract def splits: Array[Split]

val partitioner: Option[Partitioner]

def preferredLocations(split: Split): Seq[String]

Page 8: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Scheduling

Page 9: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Evaluation: PageRank

Page 10: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Scalability

Page 11: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Fault Recovery (k-means)

Page 12: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Behavior with Insufficient RAM (logistic regression)

Page 13: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

User Applications

● Conviva, data mining (40x speedup)● Mobile Millenium, traffic modeling● Twitter, spam classification● ...

Page 14: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Expressing other Models

● MapReduce, DryadLINQ● Pregel graph processing● Iterative MapReduce● SQL

Page 15: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Conclusion

● RDDs are efficient, general and fault-tolerant abstraction for cluster computing

● 20x faster then Hadoop for memory bound applications

● Can be used for interactive data mining

● Available as Open Source at http://spark-project.org