apache spark performance: past, future and present

Spark Performance Past, Future, and Present

Kay Ousterhout Joint work with Christopher Canel, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun

About Me Apache Spark PMC Member Recent PhD graduate from UC Berkeley

Thesis work on performance of large-scale data analytics Co-founder at Kelda (kelda.io)

How can I make this faster?

Should I use a different cloud instance type?

Should I trade more CPU for less

I/O by using better

compression?

How can I make this faster?

???

Major performance improvements possible via tuning, configuration

…if only you knew which knobs to turn

Past: Performance instrumentation in Spark

Future: New architecture that provides performance clarity

Present: Improving Spark’s performance instrumentation

This talk

spark.textFile(“hdfs://…”) \

.flatMap(lambda l: l.split(“ “)) \

.map(lambda w: (w, 1)) \

.reduceByKey(lambda a, b: a + b) \

.saveAsTextFile(“hdfs://…”)

Example Spark Job

Split input file into words and emit count of 1 for each

Word Count:

Example Spark Job

Split input file into words and emit count of 1 for each

Word Count:

For each word, combine the counts, and save the output

spark.textFile(“hdfs://…”) \

.flatMap(lambda l: l.split(“ “)) \

.map(lambda w: (w, 1)) \

.reduceByKey(lambda a, b: a + b) \


spark.textFile(“hdfs://…”)

.flatMap(lambda l: l.split(“ “))

.map(lambda w: (w, 1))

Map Stage: Split input file into words and emit count of 1 for each

Reduce Stage: For each word, combine the counts, and save the output

Spark Word Count Job: .reduceByKey(lambda a, b: a + b)


…

Worker 1

Worker n

Tasks

…

Worker 1

Worker n

Spark Word Count Job:

Reduce Stage: For each word, combine the counts, and save the output

.reduceByKey(lambda a, b: a + b)


…

Worker 1

Worker n

compute

network

time

(1) Request a few shuffle blocks

disk

(5) Continue fetching remote data

: time to handle one shuffle block

(2) Process local data

What happens in a reduce task?

(4) Process data fetched remotely

(3) Write output to disk

compute

network

time disk

: time to handle one shuffle block What happens in a reduce task?

Bottlenecked on network and disk

Bottlenecked on network

Bottlenecked on CPU

compute

network

time disk

What instrumentation exists today? Instrumentation centered on single, main task thread

: shuffle read blocked time : executor

computing time (!)

actual What instrumentation exists today?

timeline version

What instrumentation exists today? Instrumentation centered on single, main task thread

Shuffle read and shuffle write blocked time Input read and output write blocked time not instrumented

Possible to add!

compute

disk

Instrumenting read and write time

Process shuffle block This is a lie

compute

Reality: Spark processes and then writes one record at a time

Most writes get buffered Occasionally the buffer is flushed

compute Spark processes and then writes one record at a time

Most writes get buffered Occasionally the buffer is flushed

Challenges with reality: Record-level instrumentation is too high overhead

Spark doesn’t know when buffers get flushed (HDFS does!)

Tasks use fine-grained pipelining to parallelize resources

Instrumented times are blocked times only (task is doing other things in background)

Opportunities to improve instrumentation




This talk

Task 1

Task 2

Task 5

Task 3

Task 4

Task 7

Task 6

Task 8 time

4 concurrent tasks on a worker

Task 1

Task 2

Task 5

Task 3

Task 4

Task 7

Task 6

Task 8 time

Concurrent tasks may contend for

the same resource (e.g., network)

What’s the bottleneck?

Task 1

Task 2

Task 5

Task 3

Task 4

Task 7

Task 6

Task 8

Time t: different tasks may be

bottlenecked on different resources

Single task may be bottlenecked on

different resources at different times

Task 1

Task 2

Task 5

Task 3

Task 4

Task 7

Task 6

Task 8

How much faster would my job be with 2x disk throughput?

How would runtimes for these disk writes change?

How would that change timing of (and contention for) other resources?

Today: tasks use pipelining to parallelize multiple resources

Proposal: build systems using monotasks

that each consume just one resource

Monotasks: Each task uses one resource Network

monotask Disk monotask Compute monotask

Today’s task:

Monotasks don’t start until all dependencies complete

Task 1

Network read CPU

Disk write

Dedicated schedulers control contention

Network scheduler

CPU scheduler: 1 monotask / core

Disk drive scheduler: 1 monotask / disk

Monotasks for one of today’s tasks:

Spark today: Tasks have non-

uniform resource use

4 multi-resource tasks run

concurrently

Single-resource monotasks

scheduled by per-resource schedulers

Monotasks: API-compatible, performance

parity with Spark

Performance telemetry trivial!

How much faster would job run if...

4x more machines

Input stored in-memory No disk read

No CPU time to deserialize

Flash drives instead of disks Faster shuffle read/write time 10x improvement predicted

with at most 23% error

��

��

��

��

��

��

Monotasks: Break jobs into single-resource tasks

Using single-resource monotasks provides clarity without sacrificing performance

Massive change to Spark internals (>20K lines of code)




This talk

Spark today: Task resource use

changes at fine time granularity

4 multi-resource tasks run

concurrently

Monotasks: Single-resource tasks lead

to complete, trivial performance metrics

Can we get monotask-like per-resource metrics for each task today?

compute

network

time disk

Can we get per-task resource use?

Measure machine resource utilization?

Task 1

Task 2

Task 5

Task 3

Task 4

Task 7

Task 6

Task 8 time

4 concurrent tasks on a worker

Task 1

Task 2

Task 5

Task 3

Task 4

Task 7

Task 6

Task 8 time

Concurrent tasks may contend for

the same resource (e.g., network)

Contention controlled by

lower layers (e.g., operating system)

compute

network

time disk


Machine utilization includes other tasks

Can’t directly measure

per-task I/O: in background, mixed

with other tasks

compute

network

time disk


Existing metrics: total data read

How long did it take?

Use machine utilization metrics to get bandwidth!

Existing per-task I/O counters (e.g., shuffle bytes read) +

Machine-level utilization (and bandwidth) metrics =

Complete metrics about time spent using each resource

Goal: provide performance clarity Only way to improve performance is to know what to speed up

Why do we care about performance clarity?

Typical performance eval: group of experts Practical performance: 1 novice

Goal: provide performance clarity Only way to improve performance is to know what to speed up

Some instrumentation exists already Focuses on blocked times in the main task thread

Many opportunities to improve instrumentation (1) Add read/write instrumentation to lower level (e.g., HDFS)

(2) Add machine-level utilization info (3) Calculate per-resource time

More details at kayousterhout.org

apache spark performance: past, future and present

Software