resilient distributed datasets: a fault-tolerant ...iwanicki/courses/ds/2012/presentations… ·...

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Presentation by Zbigniew Chlebicki based on paper by Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica; University of California, Berkeley. Some images and code samples are from paper, presentation for NSDI or Spark Project website ( http://spark-project.org/ ).

Upload: others

Post on 05-Oct-2020

1 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for

In-Memory Cluster Computing

Presentation by Zbigniew Chlebicki based on paper by Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica; University of California, Berkeley.

Some images and code samples are from paper, presentation for NSDI or Spark Project website ( http://spark-project.org/ ).

http://spark-project.org/

Page 2: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

MapReduce in Hadoop

Page 3: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Resilient Distributed Datasets (RDD)

● Immutable, partitioned collection of records● Created by deterministic coarse-grained

transformations● Materialized on action● Fault-tolerant through lineage● Controllable persistence and partitioning

Page 4: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Example: Log mining

val file = spark.textFile(“hdfs://…”)

val errors = file.filter(

line => line.contains(“ERROR”)

).cache()

// Count all the errors

errors.count()

// Count errors mentioning MySQL

errors.filter(line => line.contains(“MySQL”)).count()

// Fetch the MySQL errors as an array of strings

errors.filter(line => line.contains(“MySQL”)).collect()

Page 5: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Example: Logistic Regression

val points = spark.textFile(…).map(parsePoint).cache()

var w = Vector.random(D) // current separating plane

for (i <- 1 to ITERATIONS) {

val gradient = points.map(p =>

(1 / (1 + exp(-p.y*(w dot p.x))) – 1) * p.y * p.x

).reduce(_ + _)

w -= gradient

}

println(“Final separating plane: “ + w)

Page 6: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Example: PageRank

links = // RDD of (url, neighbors) pairs

ranks = // RDD of (url, rank) pairs

for (i <- 1 to ITERATIONS) {

ranks = links.join(ranks).flatMap {

(url, (links, rank)) =>

links.map(dest => (dest, rank/links.size))

}.reduceByKey(_ + _)

}

Page 7: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Representation

abstract def compute(split: Split): Iterator[T]

abstract val dependencies: List[spark.Dependency[_]]

abstract def splits: Array[Split]

val partitioner: Option[Partitioner]

def preferredLocations(split: Split): Seq[String]

Page 8: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Scheduling

Page 9: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Evaluation: PageRank

Page 10: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Scalability

Page 11: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Fault Recovery (k-means)

Page 12: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Behavior with Insufficient RAM (logistic regression)

Page 13: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

User Applications

● Conviva, data mining (40x speedup)● Mobile Millenium, traffic modeling● Twitter, spam classification● ...

Page 14: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Expressing other Models

● MapReduce, DryadLINQ● Pregel graph processing● Iterative MapReduce● SQL

Page 15: Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Conclusion

● RDDs are efficient, general and fault-tolerant abstraction for cluster computing

● 20x faster then Hadoop for memory bound applications

● Can be used for interactive data mining

● Available as Open Source at http://spark-project.org

http://spark-project.org/

Resilient Distributed Datasets: A Fault-Tolerant Abstraction - Usenix

Designing Fault Resilient and Fault Tolerant Systems with ... · Designing Fault Resilient and Fault Tolerant Systems with InfiniBand Dhabaleswar K. (DK) Panda ... – Pro-active

Resilient Distributed Datasets: A Fault-Tolerant ... · errors = lines.filter(_.startsWith("ERROR")) errors.persist() Line 1 deﬁnes an RDD backed by an HDFS ﬁle (as a collection

Resilient Distributed Datasets: A Fault-Tolerant ... · PDF fileResilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Introduction to Database Systems CSE 414...(SparkSQL), and compiles SQL to its native Java interface CSE 414 -Spring 2018 12 Resilient Distributed Datasets •RDD = Resilient Distributed

MapReduce & Resilient Distributed Datasets · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing NSDI 2012 2345 citations Matei Zaharia,

Lecture 11: Machine Learning with Spark Jose M. Pena~ IDA ...patla00/courses/BDA/lectures/slides/ML-part… · L Zaharia, M. et al. Resilient Distributed Datasets: A Fault-Tolerant

Resilient distributed datasets a fault-tolerant abstraction for in-memory cluster computing

An Energy-Aware Fault Tolerant Scheduling Framework for ... · An Energy-Aware Fault Tolerant Scheduling Framework for Soft Error Resilient Cloud Computing Systems ... (CSP). We introduce

Spark Resilient Distributed Datasets:

Resilient Distributed Datasets

Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Resilient Distributed Datasets: A Fault-Tolerant

Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Resilient Distributed Datasets: A Fault-Tolerant …jharner/courses/stat624/docs/sparkRDD.pdfWe show that Spark is up to 20 faster than Hadoop for iterativeapplications,speedsupareal-worlddataanalyt-

Writing Fault-Tolerant Applications Using Resilient X10x10.sourceforge.net/documentation/papers/X10...5 2014/06/12 Writing Fault-Tolerant Applications Using Resilient X10 / Kiyokuni

Attack Tolerant Software (Systems) · Science of Security Lablet Resilient Architectures Attack Tolerant Software (Systems) Mladen Vouk . Professor . ATS/Mar2013/v3

Resilient Distributed Datasets (NSDI 2012)

732A54/TDDE31 Big Data Analytics - IDA > Home732A54/timetable/Lecture11.pdf · L Zaharia, M. et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

RESILIENT DISTRIBUTED DATASETS: A FAULT ...ey204/teaching/ACS/R244_2018...RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING MATEIZAHARIA,

Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Presentation by Antonio Lupher [Thanks to Matei for

Resilient New Zealand key geospatial resilience dataset Mr.Graeme... · Resilient New Zealand ... • Identify key datasets and implement an improvement plan ... – Lack of standards

MAPREDUCE & RESILIENT DISTRIBUTED DATASETS CS6410Resilient Distributed Datasets (RDD) ... The need to process large data distributed across hundreds or thousands of machines in order

B14 Apache Spark with IMS and DB2 data€¦ · Using a fault-tolerant abstraction for in-memory cluster computing –Resilient Distributed Datasets (RDDs) Can be deployed on different

Resilient Distributed Datasets: A Fault-Tolerant …...support efﬁciently is fault tolerance. In general, there are two options to make a distributed dataset fault-tolerant: checkpointing

Network-Attack-Resilient Intrusion-Tolerant SCADA for the … · 2018. 3. 5. · Network-Attack-Resilient Intrusion-Tolerant SCADA for the Power Grid Amy Babay , Thomas Tantillo ,

CSE 444: Database Internals•Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. MateiZahariaet. al. NSDI’12. CSE 444 -Winter 2019 2. Motivation

Contents · ensure scalable, resilient and fault-tolerant cloud applications 3. SSC/N8320 Develop and maintain secure, resilient and highly available application SSC/N8321 Migrate

Resilient Distributed Datasets: A Fault-Tolerant …ranger.uta.edu/~sjiang/CSE6350-spring-18/13-spark-report.pdfResilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory

Resilient Distributed Datasets: A Fault-Tolerant …matei/papers/2012/nsdi_spark.pdfResilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei

Resilient Distributed Datasets (NSDI 2012) A Fault-Tolerant Abstraction for In-Memory Cluster Computing Piccolo (OSDI 2010) Building Fast, Distributed

7610: Distributed Systemscnitarot.github.io/courses/ds_Fall_2019/mr_mesos.pdf · }Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In -Memory Cluster Computing, NSDI