spark: resilient distributed datasets for in-memory …...spark: resilient distributed datasets for...

Spark:Resilient Distributed Datasets for

In-Memory Cluster Computing

Brad KarpUCL Computer Science

(with slides contributed by Matei Zaharia)

CS M038 / GZ069th March 2016

MotivationMapReduce greatly simplified “big data” analysis on large, unreliable clusters

But as soon as it got popular, users wanted more:»More complex, multi-stage applications

(e.g. iterative machine learning & graph processing)»More interactive ad-hoc queries

MotivationMapReduce greatly simplified “big data” analysis on large, unreliable clusters

But as soon as it got popular, users wanted more:»More complex, multi-stage applications

(e.g. iterative machine learning & graph processing)»More interactive ad-hoc queriesResponse: specialized frameworks for some of these apps (e.g. Pregel for graph processing)

MotivationComplex apps and interactive queries both need one thing that MapReduce lacks:

Efficient primitives for data sharing

MotivationComplex apps and interactive queries both need one thing that MapReduce lacks:

Efficient primitives for data sharing

In MapReduce, the only way to share data across jobs is stable storage è slow!

Examples

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

Examples

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

Slow because of replication and disk I/O,but necessary for fault tolerance

iter. 1 iter. 2 . . .

Input

Goal: In-Memory Data Sharing

Input

query 1

query 2

query 3

. . .

one-timeprocessing

RAM 10-100× faster than network/disk,but how to get fault tolerance?

Challenge

How to design a distributed memory abstraction that is both fault-tolerant and

efficient?

ChallengeExisting storage abstractions have interfaces based on fine-grained updates to mutable state»RAMCloud, databases, distributed mem, Piccolo

Requires replicating data or logs across nodes for fault tolerance»Costly for data-intensive apps»10-100x slower than memory write

Solution: Resilient Distributed Datasets (RDDs)Restricted form of distributed shared memory» Immutable, partitioned collections of records»Can only be built through coarse-grained

deterministic transformations (map, filter, join, …)

Efficient fault recovery using lineage»Log one operation to apply to many elements»Recompute lost partitions on failure»No cost if nothing fails

Input

query 1

query 2

query 3

. . .

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

Input

Generality of RDDsDespite their restrictions, RDDs can express surprisingly many parallel algorithms»These naturally apply the same operation to many

items

Unify many current programming models»Data flow models: MapReduce, Dryad, SQL, …»Specialized models for iterative apps: BSP (Pregel),

iterative MapReduce (Haloop), bulk incremental, …

Support new apps that these models don’t

Tradeoff Space

Granularityof Updates

Write Throughput

Fine

Coarse

Low High

K-V stores,databases,RAMCloud

HDFS RDDs

Networkbandwidth

Tradeoff Space


Write Throughput

Fine

Coarse

Low High


HDFS RDDs

Memorybandwidth

Networkbandwidth

Tradeoff Space


Write Throughput

Fine

Coarse

Low High


HDFS RDDs

Memorybandwidth

Networkbandwidth

Tradeoff Space


Write Throughput

Fine

Coarse

Low High


Best fortransactionalworkloads

HDFS RDDs

Memorybandwidth

Networkbandwidth

Tradeoff Space


Write Throughput

Fine

Coarse

Low High


Best for batchworkloads

Best fortransactionalworkloads

HDFS RDDs

OutlineSpark programming interfaceImplementation

Demo

How people are using Spark

Spark Programming InterfaceDryadLINQ-like API in the Scala languageUsable interactively from Scala interpreter

Provides:»Resilient distributed datasets (RDDs)»Operations on RDDs: transformations (build new

RDDs), actions (compute and output results)»Control of each RDD’s partitioning (layout across

nodes) and persistence (storage in RAM, on disk, etc)

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns


Worker

Worker

Worker

Master


lines = spark.textFile(“hdfs://...”) Worker

Worker

Worker

Master


lines = spark.textFile(“hdfs://...”) Worker

Worker

Worker

Master

Base RDD


lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))Worker

Worker

Worker

Master



errors = lines.filter(_.startsWith(“ERROR”))Worker

Worker

Worker

Master

Transformed RDD



errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

Worker

Worker

Worker

Master





messages.persist()

Worker

Worker

Worker

Master





messages.persist()

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).count





messages.persist()

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).countAction





messages.persist()

Worker

Worker

Worker

Master






messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master






messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master


tasks





messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master


tasksresults





messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master


tasksresults

Msgs. 1

Msgs. 2

Msgs. 3





messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master


Msgs. 1

Msgs. 2

Msgs. 3





messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master


messages.filter(_.contains(“bar”)).count

Msgs. 1

Msgs. 2

Msgs. 3





messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master



tasks

Msgs. 1

Msgs. 2

Msgs. 3





messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master



tasksresults

Msgs. 1

Msgs. 2

Msgs. 3

Result: full-text search of Wikipedia in <1 sec (vs 20 sec

for on-disk data)





messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master



tasksresults

Msgs. 1

Msgs. 2

Msgs. 3

Result: full-text search of Wikipedia in <1 sec (vs 20 sec

for on-disk data)

Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)





messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master



tasksresults

Msgs. 1

Msgs. 2

Msgs. 3

RDDs track the graph of transformations that built them (their lineage) to rebuild lost dataE.g.: messages = textFile(...).filter(_.contains(“error”))

.map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc =

_.contains(...)

MappedRDDfunc = _.split(…)

Fault Recovery

RDDs track the graph of transformations that built them (their lineage) to rebuild lost dataE.g.: messages = textFile(...).filter(_.contains(“error”))

.map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc =

_.contains(...)

MappedRDDfunc = _.split(…)

Fault Recovery

HadoopRDD FilteredRDD MappedRDD

Fault Recovery Results119

57 56 58 5881

57 59 57 59

020406080

100120140

1 2 3 4 5 6 7 8 9 10

Iter

atio

n ti

me

(s)

Iteration

Failure happens

Fault Tolerance vs. PerformanceWith RDDs, programmer controls tradeoff between fault tolerance and performance

• if frequently persist (e.g., with REPLICATE), fast recovery, but slower execution

• if infrequently persist, fast execution but slow recovery

Why Wide vs. Narrow Dependencies?RDD includes metadata:

• lineage (graph of RDDs and operations)• wide dependency: requires shuffle (and

possibly entire preceding RDD(s) in lineage)• narrow dependency: no shuffle (and only

needs isolated partition(s) of preceding RDD(s) in lineage)

Example: PageRank1. Start each page with a rank of 12. On each iteration, update each page’s rank to

Σi∈neighbors ranki / |neighborsi|

links = // RDD of (url, neighbors) pairsranks = // RDD of (url, rank) pairs

for (i <- 1 to ITERATIONS) {ranks = links.join(ranks).flatMap {

(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))

}.reduceByKey(_ + _)}

Optimizing Placementlinks & ranks repeatedly joined

Can co-partition them (e.g. hash both on URL) to avoid shuffles

Can also use app knowledge, e.g., hash on DNS name

links = links.partitionBy(new URLPartitioner())

reduceContribs0

join

joinContribs2

Ranks0(url, rank)

Links(url, neighbors)

. . .

Ranks2

reduce

Ranks1

PageRank Performance

171

72

23

0

50

100

150

200

Tim

e pe

r ite

rati

on (s

)

Hadoop

Basic Spark

Spark + Controlled Partitioning

Memory ExhaustionLeast Recently Used eviction of partitionsFirst evict non-persisted, used partitions; they vanish, as not stored on diskThen evict persisted ones

Behavior with Insufficient RAM

68.8

58.1

40.7

29.7

11.5

0

20

40

60

80

100

0% 25% 50% 75% 100%

Iter

atio

n ti

me

(s)

Percent of working set in memory

Scalability18

4

111

76

116

80

62

15 6 3

0

50

100

150

200

250

25 50 100

Iter

atio

n ti

me

(s)

Number of machines

HadoopHadoopBinMemSpark

274

157

106

197

121

87

143

61

33

0

50

100

150

200

250

300

25 50 100It

erat

ion

tim

e (s

)

Number of machines

Hadoop HadoopBinMemSpark

Logistic Regression K-Means

Contributors to Speedup

15.4

13.1

2.9

8.4

6.9

2.9

0

5

10

15

20

In-mem HDFS In-mem local file Spark RDD

Iter

atio

n ti

me

(s)

Text Input

Binary Input

ImplementationRuns on Mesos [NSDI 11]to share cluster w/ Hadoop

Can read from any Hadoop input source (HDFS, S3, …)

Spark Hadoop MPI

Mesos

Node Node Node Node

…

No changes to Scala language or compiler» Reflection + bytecode analysis to correctly ship code

www.spark-project.org

Programming Models Implemented on SparkRDDs can express many existing parallel models»MapReduce, DryadLINQ»Pregel graph processing [200 LOC]»Iterative MapReduce [200 LOC]»SQL: Hive on Spark (Shark)

Enables apps to efficiently intermix these models

All are based oncoarse-grained operations

RDDs: SummaryRDDs offer a simple and efficient programming model for a broad range of applications

• Avoid disk I/O overhead for intermediate results of multi-phase computations

• Leverage lineage: cheaper than checkpointingwhen no failures, low cost when failures

• Leverage the coarse-grained nature of many parallel algorithms for low-overhead recovery

Scheduler significantly more complicated than MapReduce’s; we did not discuss!

spark: resilient distributed datasets for in-memory …...spark: resilient distributed datasets for...

Documents