memory management for spark - scadsa spark cluster executor w w w server executor w w w server...

59
Memory Management for Spark Ken Salem Cheriton School of Computer Science University of Waterloo

Upload: others

Post on 15-Oct-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Memory Management for Spark

Ken SalemCheriton School of Computer ScienceUniversity of Waterloo

Page 2: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Where I’m From

Page 3: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Non-RelationalDatabase Design

What We’re Doing

DBMS-ManagedEnergy

Efficiency

FlexibleTransactionalPersistence

Page 4: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Talk Plan

● Part 1: Spark overview○ What does Spark do? ○ Spark system architecture○ Spark programs○ Program execution: sessions, jobs, stages, tasks

● Part 2: Memory and Spark○ How does Spark use memory?

● Part 3: Memory-Oriented Research○ External caches○ Cache sharing○ Cache management

Michael Mior

Page 5: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Part 1: Spark Overview

Page 6: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

What Is Spark?

● Popular Apache platform for scalable data processing● Analytical programs in Scala, Java, Python, embedding

Spark parallel operators● Higher level frameworks for ETL (SparkSQL), graph

processing (GraphX), streaming (Spark Streaming), machine learning (MLLib)

● Data in HDFS, NoSQL stores, Hive, others.

Page 7: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

server

A Spark Cluster

ExecutorW W W

server

ExecutorW W W

server

ExecutorW W W

Cluster Storage

Driver One per application. Runs sequential part, farms parallel part out to Executors.

Application input/output from cluster-wide storage system (e.g., HDFS, Cassandra, Hive)

Local secondary storage available for temporary/intermediate results

Page 8: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

server

A Spark Cluster

ExecutorW W W

server

ExecutorW W W

server

ExecutorW W W

Cluster Storage

Driver Driver

ExecutorW W W

ExecutorW W W

Each application has its own private driver and executors

Page 9: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Spark Executors

Executor

W W W

cache

● Executor = JVM● Workers in an executor

share access to a cache.● Application specifies

○ (memory) size of executors

○ number of workers per executor

○ number of executors

Page 10: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

k-Means Clustering

https://elki-project.github.io/tutorial/hierarchical_clustering

Page 11: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

A Spark Program

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])

data = lines.map(parseVector).cache()

kPoints = data.takeSample(False, K, 1)

tempDist = 1.0

while tempDist > convergeDist:

closest = data.map(lambda p: (closestPoint(p, kPoints), (p, 1)))

pointStats = closest.reduceByKey(lambda p1_c1, p2_c2:

(p1_c1[0] + p2_c2[0], p1_c1[1] + p2_c2[1]))

newPoints = pointStats.map(lambda st: (st[0], st[1][0] / st[1][1])).collect()

tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints)

for (iK, p) in newPoints:

kPoints[iK] = p

print("Final centers: " + str(kPoints))

Define an RDD of data points, from a file.

Each element of the RDD is a data point.

Page 12: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

RDDs

● RDD = Resilient Distributed Dataset

○ Set of elements on which parallel operations can be performed

● RDDs can be defined from:

○ Collections defined in the application program

○ External data sources (e.g., files)

○ Other RDDs

● Spark defines operations that can be performed on RDDs.

○ RDD operations are how Spark apps expose parallelism

Page 13: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])

data = lines.map(parseVector).cache()

kPoints = data.takeSample(False, K, 1)

tempDist = 1.0

while tempDist > convergeDist:

closest = data.map(lambda p: (closestPoint(p, kPoints), (p, 1)))

pointStats = closest.reduceByKey(lambda p1_c1, p2_c2:

(p1_c1[0] + p2_c2[0], p1_c1[1] + p2_c2[1]))

newPoints = pointStats.map(lambda st: (st[0], st[1][0] / st[1][1])).collect()

tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints)

for (iK, p) in newPoints:

kPoints[iK] = p

print("Final centers: " + str(kPoints))

Spark Actions

● takeSample is an example of a Spark action

● Actions generate normal program values (in this case, a Python array) from RDDs

● kPoints is the initial set of cluster centers

Page 14: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Spark Program Graph

Input file

data

RDD

takeSample

lines

Page 15: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

A Spark Program

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])

data = lines.map(parseVector).cache()

kPoints = data.takeSample(False, K, 1)

tempDist = 1.0

while tempDist > convergeDist:

closest = data.map(lambda p: (closestPoint(p, kPoints), (p, 1)))

pointStats = closest.reduceByKey(lambda p1_c1, p2_c2:

(p1_c1[0] + p2_c2[0], p1_c1[1] + p2_c2[1]))

newPoints = pointStats.map(lambda st: (st[0], st[1][0] / st[1][1])).collect()

tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints)

for (iK, p) in newPoints:

kPoints[iK] = p

print("Final centers: " + str(kPoints))

map and reduceByKey are Spark transformations - they define a new RDD from an existing RDD.Transformations are evaluated lazily.

● Elements of closest: (index, (point,1))● Elements of pointStats: (index, (pointsum, count))● Elements of newPoints: (index, point)

Page 16: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Spark Program Graph

data

takeSample

lines

closest

pointStats

newPoints

collect

Page 17: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

A Spark Program

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])

data = lines.map(parseVector).cache()

kPoints = data.takeSample(False, K, 1)

tempDist = 1.0

while tempDist > convergeDist:

closest = data.map(lambda p: (closestPoint(p, kPoints), (p, 1)))

pointStats = closest.reduceByKey(lambda p1_c1, p2_c2:

(p1_c1[0] + p2_c2[0], p1_c1[1] + p2_c2[1]))

newPoints = pointStats.map(lambda st: (st[0], st[1][0] / st[1][1])).collect()

tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints)

for (iK, p) in newPoints:

kPoints[iK] = p

print("Final centers: " + str(kPoints))

● Regular Python code to compute (squared) distance between newPoints and current cluster centers (in kPoints)

● Iteration terminates when distance is small enough

Page 18: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Spark Program Graph

data

takeSample

lines

closest

pointStats

newPoints

collect

closest

pointStats

newPoints

collect

closest

pointStats

newPoints

collect

Page 19: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Another Example Program

tc = spark.sparkContext.parallelize(generateGraph(), partitions).cache()

edges = tc.map(lambda x_y: (x_y[1], x_y[0]))

oldCount = 0

nextCount = tc.count()

while True:

oldCount = nextCount

# Perform the join, obtaining an RDD of (y, (z, x)) pairs,

# then project the result to obtain the new (x, z) paths.

new_edges = tc.join(edges).map(lambda __a_b: (__a_b[1][1], __a_b[1][0]))

tc = tc.union(new_edges).distinct().cache()

nextCount = tc.count()

if nextCount == oldCount:

Break

print("TC has %i edges" % tc.count())

actions

unary transformations

binary transformations

Page 20: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Spark Jobs

data

takeSample

lines

closest

pointStats

newPoints

collect

closest

pointStats

newPoints

collect

closest

pointStats

newPoints

collect

One Spark job per action.

Page 21: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Spark Jobs

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])

data = lines.map(parseVector).cache()

kPoints = data.takeSample(False, K, 1)

tempDist = 1.0

while tempDist > convergeDist:

closest = data.map(lambda p: (closestPoint(p, kPoints), (p, 1)))

pointStats = closest.reduceByKey(lambda p1_c1, p2_c2:

(p1_c1[0] + p2_c2[0], p1_c1[1] + p2_c2[1]))

newPoints = pointStats.map(lambda st: (st[0], st[1][0] / st[1][1])).collect()

tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints)

for (iK, p) in newPoints:

kPoints[iK] = p

print("Final centers: " + str(kPoints))

Job 1

Jobs 2...N

Page 22: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Spark Jobs

● One job per Spark action.

● Spark transformations are evaluated lazily.

● Jobs are created sequentially, as actions are encountered during program execution.

● Spark driver does not know in advance how many jobs a program will produce.

Page 23: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

RDD Partitioning

data

lines

closest

pointStats

newPoints

collect

[2,3][1,3][7,6][3,3][2,5]

[4,2][3,2][1,1][7,2][8,2][5,1]

[6,3][1,2][2,3][4,7][6,1]

(1,([2,3],1))(1,([1,3],1))(2,([7,6],1))(1,([3,3],1))(2,([2,5],1))

(2,([4,2],1))(2,([3,2],1))(1,([1,1],1))(2,([7,2],1))(2,([8,2],1))(2,([5,1],1))

(2,([6,3],1))(1,([1,2],1))(1,([2,3],1))(2,([4,7],1))(1,([6,1],1))

(1,([2,3],1))(1,([1,3],1))(1,([3,3],1))(1,([1,1],1))(1,([1,2],1))(1,([2,3],1))(1,([6,1],1))

(2,([7,6],1))(2,([2,5],1))(2,([4,2],1))(2,([3,2],1))(2,([7,2],1))(2,([8,2],1))(2,([5,1],1))(2,([6,3],1))(2,([4,7],1))

(1,([16,16],7))

(2,([46,30],9))

(1,[2.28,2.28])

(2,[5.11,3.33])

file

data closest pointStats newPoints

shuffle

Page 24: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Stages and Tasks

data

lines

closest

pointStats

newPoints

collect

[2,3][1,3][7,6][3,3][2,5]

[4,2][3,2][1,1][7,2][8,2][5,1]

[6,3][1,2][2,3][4,7][6,1]

(1,([2,3],1))(1,([1,3],1))(2,([7,6],1))(1,([3,3],1))(2,([2,5],1))

(2,([4,2],1))(2,([3,2],1))(1,([1,1],1))(2,([7,2],1))(2,([8,2],1))(2,([5,1],1))

(2,([6,3],1))(1,([1,2],1))(1,([2,3],1))(2,([4,7],1))(1,([6,1],1))

(1,([2,3],1))(1,([1,3],1))(1,([3,3],1))(1,([1,1],1))(1,([1,2],1))(1,([2,3],1))(1,([6,1],1))

(2,([7,6],1))(2,([2,5],1))(2,([4,2],1))(2,([3,2],1))(2,([7,2],1))(2,([8,2],1))(2,([5,1],1))(2,([6,3],1))(2,([4,7],1))

(1,([16,16],7))

(2,([46,30],9))

(1,[2.28,2.28])

(2,[5.11,3.33])

file

data closest pointStats newPoints

TASK

TASK

TASK

TASK

TASK

TASK

TASK

TASK

Job 1 Job 2, Stage 1 Job 2, Stage 2

Page 25: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])

data = lines.map(parseVector).cache()

kPoints = data.takeSample(False, K, 1)

tempDist = 1.0

while tempDist > convergeDist:

closest = data.map(lambda p: (closestPoint(p, kPoints), (p, 1)))

pointStats = closest.reduceByKey(lambda p1_c1, p2_c2:

(p1_c1[0] + p2_c2[0], p1_c1[1] + p2_c2[1]))

newPoints = pointStats.map(lambda st: (st[0], st[1][0] / st[1][1])).collect()

tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints)

for (iK, p) in newPoints:

kPoints[iK] = p

print("Final centers: " + str(kPoints))

Job Execution and Scheduling

1. Local execution at Driver until takeSample triggers Job 1.

2. Scheduler determines Job 1 stages and tasks, and assigns tasks to Executors.

Page 26: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

server

Spark Scheduler

ExecutorW W W

server

ExecutorW W W

server

ExecutorW W W

Cluster Storage

Driver

Job 1 Tasks

Page 27: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])

data = lines.map(parseVector).cache()

kPoints = data.takeSample(False, K, 1)

tempDist = 1.0

while tempDist > convergeDist:

closest = data.map(lambda p: (closestPoint(p, kPoints), (p, 1)))

pointStats = closest.reduceByKey(lambda p1_c1, p2_c2:

(p1_c1[0] + p2_c2[0], p1_c1[1] + p2_c2[1]))

newPoints = pointStats.map(lambda st: (st[0], st[1][0] / st[1][1])).collect()

tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints)

for (iK, p) in newPoints:

kPoints[iK] = p

print("Final centers: " + str(kPoints))

Job Execution and Scheduling

1. Local execution at Driver until collect triggers Job 2.

2. Scheduler determines Job 2 stages and tasks, and assigns tasks to Executors.

Page 28: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

server

Spark Scheduler

ExecutorW W W

server

ExecutorW W W

server

ExecutorW W W

Cluster Storage

Driver

Job 2 Tasks

Page 29: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Part 2: Spark and Memory

Page 30: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

● Memory for task execution

○ Shuffles, joins, sorts, aggregations,....

● Memory for caching

○ Saved RDD partitions

● Indirect Memory use, e.g., file caching

How Does Spark Use Memory?

Page 31: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

container

Spark Executor Memory

executor (JVM)

Spark memory

storagememory

executionmemory

● Boundary can adjust dynamically

● Execution can evict stored RDDs

Storage lower bound

Page 32: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Executor Out-of-Memory Failures

From: M. Kunjir, S. Babu. Understanding Memory Management In Spark For Fun And Profit. Spark Summit 2016. Used with permission.

Page 33: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Performance Depends on Memory

failure @ 512MB

Page 34: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Caching in Spark

data

takeSample

lines

closest

pointStats

newPoints

collect

closest

pointStats

newPoints

collect

closest

pointStats

newPoints

collect

Job 1 computes data RDD, takes sample.

Job 2 must re-compute the data RDD, unless it has been cached!

Job 3 also needs the data RDD.

Page 35: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

What to Cache?

Spark can cache some or all of the partitions of any RDD.

Q: How does Spark know which RDDs to cache?

A: The programmer must tell it!

Page 36: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Application-Managed Caching

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])

data = lines.map(parseVector).cache()

kPoints = data.takeSample(False, K, 1)

tempDist = 1.0

while tempDist > convergeDist:

closest = data.map(lambda p: (closestPoint(p, kPoints), (p, 1)))

pointStats = closest.reduceByKey(lambda p1_c1, p2_c2:

Quartet: Harmonizing task scheduling and caching for cluster

computing (p1_c1[0] + p2_c2[0], p1_c1[1] + p2_c2[1]))

newPoints = pointStats.map(lambda st: (st[0], st[1][0] / st[1][1])).collect()

tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints)

for (iK, p) in newPoints:

kPoints[iK] = p

print("Final centers: " + str(kPoints))

Dear Spark:

Please cache this RDD.

Thanks,T. Programmer

Page 37: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

persist()

● Spark programmers use persist(storage_level) to identify RDDs that should be cached

● cache() means persist at the default storage level ● storage_level?

○ MEMORY_ONLY

○ MEMORY_ONLY_SER

○ DISK_ONLY

○ DISK_AND_MEMORY

○ and more….

● Also: unpersist()

Page 38: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Effect of Caching

Page 39: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Spark Caching Advice

Don’t cache all RDDs.

Do cache RDDs that are re-used.

Unless there are too many of them.

In which case, just cache some of them

Or maybe try a DISK or SER level, depending on I/O capacity, recomputation cost, and serialization effectiveness.

Page 40: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Part III: Research

TopicsExploit Re-Use of Inputs

Cache Sharing Across ProgramsBetter Cache Mangement

Page 41: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Research: Quartet

● Input skew: 80% accesses to 10% of files● Reuse happens quickly: 90% of re-use within one hour● Key idea: run tasks with cached file inputs first

○ Tasks in each stage are independent○ Maximize “memory local” task input

Deslaurier et al. Quartet: Harmonizing task scheduling and caching for cluster computing.HotStorage 2016

● Exploits Duet: in-kernel framework for exposing contents of file cache to applications

Page 42: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

server

In-Kernel File Cache

ExecutorW W W

server

ExecutorW W W

server

ExecutorW W W

Driver Driver

ExecutorW W W

ExecutorW W W

Cluster Storage

cache cache cache

Page 43: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Quartet Example

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])

data = lines.map(parseVector).cache()

kPoints = data.takeSample(False, K, 1)

tempDist = 1.0

while tempDist > convergeDist:

closest = data.map(lambda p: (closestPoint(p, kPoints), (p, 1)))

pointStats = closest.reduceByKey(lambda p1_c1, p2_c2:

(p1_c1[0] + p2_c2[0], p1_c1[1] + p2_c2[1]))

newPoints = pointStats.map(lambda st: (st[0], st[1][0] / st[1][1])).collect()

tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints)

for (iK, p) in newPoints:

kPoints[iK] = p

print("Final centers: " + str(kPoints)) This job runs faster if the input is already (partially) in the file cache. This job does not benefit.

Page 44: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Research: Cache Sharing

● Quartet can exploit data sharing across programs, since file system cache is external to Spark executors.

● Other ways to achieve this:

○ External caches, e.g., Tachyon/Alluxio○ Concurrent Spark applications with shared Spark

contexts

Page 45: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

server

External Caches

W W W

server

W W W

server

W W W

Driver Driver

W W W W W W

Cluster Storage

cache cache cache

Alluxio

● Distributed, shared in-memory file system

● Caching, or tiered with underlying persistent file system

Page 46: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

server

Sharing Spark Contexts

ExecutorW W W

server

ExecutorW W W

server

ExecutorW W W

Driver

Cluster Storage

T1 T2 ● Concurrent Spark app● Each thread can run

Spark jobs● All jobs share the same

Executors (and executor caches)

Page 47: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Research: Robus

● Goal: Manage a shared cache to improve performance of a multi-tenant workload, while maintaining fairness.

● Approach: (1) Batch incoming jobs, (2) make caching decision for batch, (3) load cache, (4) run batch, (5) repeat.

● Enables sharing within batch, not across batches.

Kunjir et al. ROBUS: Fair Cache Allocation for Data-parallel Workloads. Proc. SIGMOD 2017.

Page 48: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

server

Robus

ExecutorW W W

server

ExecutorW W W

server

ExecutorW W W

ROBUS(driver)

Cluster Storage

job queue, with weight (relative importance)

Page 49: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Research: What to Cache?

● Spark applications identify RDDs to be cached

● Executors attempt to cache what the application identifies.

● LRU eviction

● Cache granularity = RDD partition

Page 50: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Application-Managed Caching:Best Practice

What to cache?● RDDs that will be re-used

How to cache it?● MEMORY_ONLY if everything fits● MEMORY_ONLY_SER if it helps things fit● Avoid disk unless computation is expensive or selective

Difficult Advice to Follow!

Can we automate this?

Page 51: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Spark Caching Example

A

B

C

D

E

F

G

H

I

J

K

L

M

Job 2

Job 1

Job 3 Job 4

Application says to cache C, E, and J

Is this the best choice?Can Spark make it automatically?

Page 52: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Caching Issue #1: Partial Info

A

B

C

D

E

F

G

H

I

J

K

L

M

Job 2

Job 1

Job 3 Job 4

Amount of re-use of C,E is unknown after Job 1

Unpersist after Job 2?

Page 53: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Caching Issue #2: Job Structure

A

B

C

D

E

F

G

H

I

J

K

L

M

Job 2

Job 1

Job 3 Job 4

Cost of recalculating E depends on whether its ancestors are cached

Page 54: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Caching Issue #3: Unhinted Candidates

A

B

C

D

E

F

G

H

I

J

K

L

M

Job 2

Job 1

Job 3 Job 4

Alternatives to C, E, J.

Is D smaller than E?

Is D -> E expensive?

Page 55: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

A Larger Example

9 jobs

16 stages

55 RDDs

Page 56: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Research: Neutrino

● Goal: Eliminate programmer-specified storage levels.

Xu et al. Neutrino: Revisiting Memory Caching for Iterative Data Analytics. Proc. USENIX HotStorage 2016.

● Approach:○ Pilot run(s) to learn complete program

graph○ Schedule cache/convert/discard on RDD

partitions at each stage to achieve min-cost execution.

Page 57: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Our Take on the Caching Problem

Goal: Application-managed caching ⇒ Application-assisted caching

Approach:Application identifies re-use

● avoid pilot runs, program analysisSpark maximizes effectiveness of available space

● Key challenge: size and cost estimation!● Approach: on-the-fly estimation, conditional plans.

Page 58: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Summary

● Spark applications expose parallelism using RDD operations

● Spark provides distributed, fault-tolerant, in-memory execution

● Memory is a critical resource for Spark○ For program execution, for result caching

● Related Research○ Exploiting re-use of input data and query results across programs○ What, where, and how to cache results

Page 59: Memory Management for Spark - ScaDSA Spark Cluster Executor W W W server Executor W W W server Executor W W W Cluster Storage Driver One per application. Runs sequential part, farms

Thanks!

[email protected]