resilient distributed datasets: a fault-tolerant …ey204/teaching/acs/r212_2015_2016/...resilient...

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Christopher Little M. Zaharia et al. (2012)

Upload: others

Post on 05-Jun-2020

2 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Christopher Little

M. Zaharia et al. (2012)

OutlineContext

Resilient Distributed Datasets

Spark

Evaluation

Page 3: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Context.

Page 4: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Input Data

On Disk

Tuples

On Disk

Tuples

On Disk

Tuples

On Disk

Output Data

On Disk

MapReduce

Page 5: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Input Data

On Disk

Tuples

On Disk

Tuples

On Disk

Tuples

On Disk

Output Data

On Disk

MapReduce

Google Pregel

Page 6: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Dryad

Page 7: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Distributed Shared Memory

Page 8: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Resilient Distributed Datasets (RDD).

Page 9: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Structure of RDDs

Partition 1

Partition 2

Partition 3

Page 10: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Structure of RDDs

Partition 1

Partition 2

Partition 3

Partition 1

Partition 2

Partition 3

Stable Storage

Page 11: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Structure of RDDs

Partition 1

Partition 2

Partition 3

Partition 1

Partition 2

Partition 3

map/filter

Page 12: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Structure of RDDs

Partition 1

Partition 2

Partition 1

Partition 2

Partition 3

groupByKey

Page 13: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Spark.

Page 14: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Spark example

lines = spark.textFile( "hdfs://..." )errors = lines.filter( _.startsWith( "ERROR" ) )

errors.persist()

// Count errors mentioning MySQL:errors.filter( _.contains( "MySQL" ) ) .count() // Nothing computed until now

// Return the time fields of errors mentioning// HDFS as an array (assuming time is field// number 3 in a tab-separated format):errors.filter( _.contains( "HDFS" ) ) .map( _.split( '\t' )( 3 ) ) .collect( )

Page 15: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Scheduling

Page 16: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Evaluation.

Page 17: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

20xSpark claims to be up to 20x faster than Hadoop in iterative applications

Page 18: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Figure 8: Running times for iterations after the first in Hadoop, HadoopBinMem, and Spark. The jobs all processed 100 GB.

Page 19: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Figure 12(logistic regression using 100 GB

on 25 nodes)

Figure 8(logistic regression using 100 GB)

Page 20: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Page 21: Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Spark promises a lot, but the evidence presented here is insufficient.(but it could live up to their claims)

Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Resilient Distributed Datasets (NSDI 2012)

Contents · ensure scalable, resilient and fault-tolerant cloud applications 3. SSC/N8320 Develop and maintain secure, resilient and highly available application SSC/N8321 Migrate

Resilient Distributed Datasets (NSDI 2012) A Fault-Tolerant Abstraction for In-Memory Cluster Computing Piccolo (OSDI 2010) Building Fast, Distributed

Resilient Distributed Datasets

Resilient Distributed Datasets: A Fault-Tolerant ...iwanicki/courses/ds/2012/presentations… · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Resilient Distributed Datasets: A Fault-Tolerant …ranger.uta.edu/~sjiang/CSE6350-spring-18/13-spark-report.pdfResilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory

Resilient Distributed DataSets - Apache SPARK

Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Resilient Distributed Datasets: A Fault-Tolerant …...support efﬁciently is fault tolerance. In general, there are two options to make a distributed dataset fault-tolerant: checkpointing

Resilient Distributed Datasets: A Fault-Tolerant …matei/papers/2012/nsdi_spark.pdfResilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei

Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Designing Fault Resilient and Fault Tolerant Systems with ... · Designing Fault Resilient and Fault Tolerant Systems with InfiniBand Dhabaleswar K. (DK) Panda ... – Pro-active

Resilient Distributed Datasets: A Fault-Tolerant …jharner/courses/stat624/docs/sparkRDD.pdfWe show that Spark is up to 20 faster than Hadoop for iterativeapplications,speedsupareal-worlddataanalyt-

MapReduce & Resilient Distributed Datasets · Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing NSDI 2012 2345 citations Matei Zaharia,

Resilient Distributed - GitHub Pages · This paper presents Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations

Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Presentation by Antonio Lupher [Thanks to Matei for

Spark - University of Nevada, Las Vegasmkang.faculty.unlv.edu/teaching/CS789/06.Spark.pdf- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Resilient Distributed Datasets: A Fault-Tolerant ...people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf · Resilient Distributed Datasets: ... system called Spark, ... algorithms,

Resilient Distributed Datasets: A Fault-Tolerant

MAPREDUCE & RESILIENT DISTRIBUTED DATASETS CS6410Resilient Distributed Datasets (RDD) ... The need to process large data distributed across hundreds or thousands of machines in order

Future of Computing II€¦ · Spark: Key Idea Features: •In-memory speed w/fault tolerance via lineage tracking •Bulk Synchronous Resilient Distributed Datasets: A Fault-Tolerant

Conference Reports conferenceco-authors for “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing” (see their “Fast and Interactive Analytics

Lecture 11: Machine Learning with Spark Jose M. Pena~ IDA ...patla00/courses/BDA/lectures/slides/ML-part… · L Zaharia, M. et al. Resilient Distributed Datasets: A Fault-Tolerant

Resilient Distributed Datasets: A Fault-Tolerant ... · data items. This allows them to efﬁciently provide fault tolerance by logging the transformations used to build a dataset

Network-Attack-Resilient Intrusion-Tolerant SCADA for the … · 2018. 3. 5. · Network-Attack-Resilient Intrusion-Tolerant SCADA for the Power Grid Amy Babay , Thomas Tantillo ,

Attack Tolerant Software (Systems) · Science of Security Lablet Resilient Architectures Attack Tolerant Software (Systems) Mladen Vouk . Professor . ATS/Mar2013/v3

Resilient Distributed Datasetspavlo/courses/fall2013/... · Resilient Distributed Datasets (RDDs) •Restricted form of distributed shared memory –read-only, partitioned collection

Resilient New Zealand key geospatial resilience dataset Mr.Graeme... · Resilient New Zealand ... • Identify key datasets and implement an improvement plan ... – Lack of standards

Resilient Distributed Datasets: A Fault-Tolerant ... · errors = lines.filter(_.startsWith("ERROR")) errors.persist() Line 1 deﬁnes an RDD backed by an HDFS ﬁle (as a collection

CSE 444: Database Internals•Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. MateiZahariaet. al. NSDI’12. CSE 444 -Winter 2019 2. Motivation

MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

B14 Apache Spark with IMS and DB2 data€¦ · Using a fault-tolerant abstraction for in-memory cluster computing –Resilient Distributed Datasets (RDDs) Can be deployed on different

7610: Distributed Systemscnitarot.github.io/courses/ds_Fall_2019/mr_mesos.pdf · }Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In -Memory Cluster Computing, NSDI