spark: resilient distributed datasets for in-memory …...spark: resilient distributed datasets for...

64
Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed by Matei Zaharia) CS M038 / GZ06 9 th March 2016

Upload: others

Post on 05-Jun-2020

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Spark:Resilient Distributed Datasets for

In-Memory Cluster Computing

Brad KarpUCL Computer Science

(with slides contributed by Matei Zaharia)

CS M038 / GZ069th March 2016

Page 2: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

MotivationMapReduce greatly simplified “big data” analysis on large, unreliable clusters

But as soon as it got popular, users wanted more:»More complex, multi-stage applications

(e.g. iterative machine learning & graph processing)»More interactive ad-hoc queries

Page 3: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

MotivationMapReduce greatly simplified “big data” analysis on large, unreliable clusters

But as soon as it got popular, users wanted more:»More complex, multi-stage applications

(e.g. iterative machine learning & graph processing)»More interactive ad-hoc queriesResponse: specialized frameworks for some of these apps (e.g. Pregel for graph processing)

Page 4: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

MotivationComplex apps and interactive queries both need one thing that MapReduce lacks:

Efficient primitives for data sharing

Page 5: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

MotivationComplex apps and interactive queries both need one thing that MapReduce lacks:

Efficient primitives for data sharing

In MapReduce, the only way to share data across jobs is stable storage è slow!

Page 6: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Examples

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

Page 7: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Examples

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

Slow because of replication and disk I/O,but necessary for fault tolerance

Page 8: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

iter. 1 iter. 2 . . .

Input

Goal: In-Memory Data Sharing

Input

query 1

query 2

query 3

. . .

one-timeprocessing

RAM 10-100× faster than network/disk,but how to get fault tolerance?

Page 9: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Challenge

How to design a distributed memory abstraction that is both fault-tolerant and

efficient?

Page 10: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

ChallengeExisting storage abstractions have interfaces based on fine-grained updates to mutable state»RAMCloud, databases, distributed mem, Piccolo

Requires replicating data or logs across nodes for fault tolerance»Costly for data-intensive apps»10-100x slower than memory write

Page 11: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Solution: Resilient Distributed Datasets (RDDs)Restricted form of distributed shared memory» Immutable, partitioned collections of records»Can only be built through coarse-grained

deterministic transformations (map, filter, join, …)

Efficient fault recovery using lineage»Log one operation to apply to many elements»Recompute lost partitions on failure»No cost if nothing fails

Page 12: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Input

query 1

query 2

query 3

. . .

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

Input

Page 13: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Input

query 1

query 2

query 3

. . .

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

Input

Page 14: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Input

query 1

query 2

query 3

. . .

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

Input

Page 15: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Input

query 1

query 2

query 3

. . .

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

Input

Page 16: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Input

query 1

query 2

query 3

. . .

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

Input

Page 17: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Input

query 1

query 2

query 3

. . .

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

Input

Page 18: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Input

query 1

query 2

query 3

. . .

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

Input

Page 19: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Input

query 1

query 2

query 3

. . .

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

Input

Page 20: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Generality of RDDsDespite their restrictions, RDDs can express surprisingly many parallel algorithms»These naturally apply the same operation to many

items

Unify many current programming models»Data flow models: MapReduce, Dryad, SQL, …»Specialized models for iterative apps: BSP (Pregel),

iterative MapReduce (Haloop), bulk incremental, …

Support new apps that these models don’t

Page 21: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Tradeoff Space

Granularityof Updates

Write Throughput

Fine

Coarse

Low High

K-V stores,databases,RAMCloud

HDFS RDDs

Page 22: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Networkbandwidth

Tradeoff Space

Granularityof Updates

Write Throughput

Fine

Coarse

Low High

K-V stores,databases,RAMCloud

HDFS RDDs

Page 23: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Memorybandwidth

Networkbandwidth

Tradeoff Space

Granularityof Updates

Write Throughput

Fine

Coarse

Low High

K-V stores,databases,RAMCloud

HDFS RDDs

Page 24: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Memorybandwidth

Networkbandwidth

Tradeoff Space

Granularityof Updates

Write Throughput

Fine

Coarse

Low High

K-V stores,databases,RAMCloud

Best fortransactionalworkloads

HDFS RDDs

Page 25: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Memorybandwidth

Networkbandwidth

Tradeoff Space

Granularityof Updates

Write Throughput

Fine

Coarse

Low High

K-V stores,databases,RAMCloud

Best for batchworkloads

Best fortransactionalworkloads

HDFS RDDs

Page 26: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

OutlineSpark programming interfaceImplementation

Demo

How people are using Spark

Page 27: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

OutlineSpark programming interfaceImplementation

Demo

How people are using Spark

Page 28: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Spark Programming InterfaceDryadLINQ-like API in the Scala languageUsable interactively from Scala interpreter

Provides:»Resilient distributed datasets (RDDs)»Operations on RDDs: transformations (build new

RDDs), actions (compute and output results)»Control of each RDD’s partitioning (layout across

nodes) and persistence (storage in RAM, on disk, etc)

Page 29: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 30: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Worker

Worker

Worker

Master

Page 31: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) Worker

Worker

Worker

Master

Page 32: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) Worker

Worker

Worker

Master

Base RDD

Page 33: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))Worker

Worker

Worker

Master

Page 34: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))Worker

Worker

Worker

Master

Transformed RDD

Page 35: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

Worker

Worker

Worker

Master

Page 36: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Worker

Worker

Worker

Master

Page 37: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

Page 38: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).countAction

Page 39: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

Page 40: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

Page 41: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

tasks

Page 42: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

tasksresults

Page 43: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

tasksresults

Msgs. 1

Msgs. 2

Msgs. 3

Page 44: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

Msgs. 1

Msgs. 2

Msgs. 3

Page 45: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

messages.filter(_.contains(“bar”)).count

Msgs. 1

Msgs. 2

Msgs. 3

Page 46: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

messages.filter(_.contains(“bar”)).count

tasks

Msgs. 1

Msgs. 2

Msgs. 3

Page 47: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

messages.filter(_.contains(“bar”)).count

tasksresults

Msgs. 1

Msgs. 2

Msgs. 3

Page 48: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Result: full-text search of Wikipedia in <1 sec (vs 20 sec

for on-disk data)

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

messages.filter(_.contains(“bar”)).count

tasksresults

Msgs. 1

Msgs. 2

Msgs. 3

Page 49: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Result: full-text search of Wikipedia in <1 sec (vs 20 sec

for on-disk data)

Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count

messages.filter(_.contains(“bar”)).count

tasksresults

Msgs. 1

Msgs. 2

Msgs. 3

Page 50: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

RDDs track the graph of transformations that built them (their lineage) to rebuild lost dataE.g.: messages = textFile(...).filter(_.contains(“error”))

.map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc =

_.contains(...)

MappedRDDfunc = _.split(…)

Fault Recovery

Page 51: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

RDDs track the graph of transformations that built them (their lineage) to rebuild lost dataE.g.: messages = textFile(...).filter(_.contains(“error”))

.map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc =

_.contains(...)

MappedRDDfunc = _.split(…)

Fault Recovery

HadoopRDD FilteredRDD MappedRDD

Page 52: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Fault Recovery Results119

57 56 58 5881

57 59 57 59

020406080

100120140

1 2 3 4 5 6 7 8 9 10

Iter

atio

n ti

me

(s)

Iteration

Failure happens

Page 53: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Fault Tolerance vs. PerformanceWith RDDs, programmer controls tradeoff between fault tolerance and performance

• if frequently persist (e.g., with REPLICATE), fast recovery, but slower execution

• if infrequently persist, fast execution but slow recovery

Page 54: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Why Wide vs. Narrow Dependencies?RDD includes metadata:

• lineage (graph of RDDs and operations)• wide dependency: requires shuffle (and

possibly entire preceding RDD(s) in lineage)• narrow dependency: no shuffle (and only

needs isolated partition(s) of preceding RDD(s) in lineage)

Page 55: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Example: PageRank1. Start each page with a rank of 12. On each iteration, update each page’s rank to

Σi∈neighbors ranki / |neighborsi|

links = // RDD of (url, neighbors) pairsranks = // RDD of (url, rank) pairs

for (i <- 1 to ITERATIONS) {ranks = links.join(ranks).flatMap {

(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))

}.reduceByKey(_ + _)}

Page 56: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Optimizing Placementlinks & ranks repeatedly joined

Can co-partition them (e.g. hash both on URL) to avoid shuffles

Can also use app knowledge, e.g., hash on DNS name

links = links.partitionBy(new URLPartitioner())

reduceContribs0

join

joinContribs2

Ranks0(url, rank)

Links(url, neighbors)

. . .

Ranks2

reduce

Ranks1

Page 57: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

PageRank Performance

171

72

23

0

50

100

150

200

Tim

e pe

r ite

rati

on (s

)

Hadoop

Basic Spark

Spark + Controlled Partitioning

Page 58: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Memory ExhaustionLeast Recently Used eviction of partitionsFirst evict non-persisted, used partitions; they vanish, as not stored on diskThen evict persisted ones

Page 59: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Behavior with Insufficient RAM

68.8

58.1

40.7

29.7

11.5

0

20

40

60

80

100

0% 25% 50% 75% 100%

Iter

atio

n ti

me

(s)

Percent of working set in memory

Page 60: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Scalability18

4

111

76

116

80

62

15 6 3

0

50

100

150

200

250

25 50 100

Iter

atio

n ti

me

(s)

Number of machines

HadoopHadoopBinMemSpark

274

157

106

197

121

87

143

61

33

0

50

100

150

200

250

300

25 50 100It

erat

ion

tim

e (s

)

Number of machines

Hadoop HadoopBinMemSpark

Logistic Regression K-Means

Page 61: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Contributors to Speedup

15.4

13.1

2.9

8.4

6.9

2.9

0

5

10

15

20

In-mem HDFS In-mem local file Spark RDD

Iter

atio

n ti

me

(s)

Text Input

Binary Input

Page 62: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

ImplementationRuns on Mesos [NSDI 11]to share cluster w/ Hadoop

Can read from any Hadoop input source (HDFS, S3, …)

Spark Hadoop MPI

Mesos

Node Node Node Node

No changes to Scala language or compiler» Reflection + bytecode analysis to correctly ship code

www.spark-project.org

Page 63: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

Programming Models Implemented on SparkRDDs can express many existing parallel models»MapReduce, DryadLINQ»Pregel graph processing [200 LOC]»Iterative MapReduce [200 LOC]»SQL: Hive on Spark (Shark)

Enables apps to efficiently intermix these models

All are based oncoarse-grained operations

Page 64: Spark: Resilient Distributed Datasets for In-Memory …...Spark: Resilient Distributed Datasets for In-Memory Cluster Computing Brad Karp UCL Computer Science (with slides contributed

RDDs: SummaryRDDs offer a simple and efficient programming model for a broad range of applications

• Avoid disk I/O overhead for intermediate results of multi-phase computations

• Leverage lineage: cheaper than checkpointingwhen no failures, low cost when failures

• Leverage the coarse-grained nature of many parallel algorithms for low-overhead recovery

Scheduler significantly more complicated than MapReduce’s; we did not discuss!