apache spark performance: past, future and present
TRANSCRIPT
![Page 1: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/1.jpg)
Spark Performance Past, Future, and Present
Kay Ousterhout Joint work with Christopher Canel, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun
![Page 2: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/2.jpg)
About Me Apache Spark PMC Member Recent PhD graduate from UC Berkeley
Thesis work on performance of large-scale data analytics Co-founder at Kelda (kelda.io)
![Page 3: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/3.jpg)
How can I make this faster?
![Page 4: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/4.jpg)
How can I make this faster?
![Page 5: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/5.jpg)
Should I use a different cloud instance type?
![Page 6: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/6.jpg)
Should I trade more CPU for less
I/O by using better
compression?
![Page 7: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/7.jpg)
How can I make this faster?
???
![Page 8: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/8.jpg)
How can I make this faster?
???
![Page 9: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/9.jpg)
How can I make this faster?
???
![Page 10: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/10.jpg)
Major performance improvements possible via tuning, configuration
…if only you knew which knobs to turn
![Page 11: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/11.jpg)
Past: Performance instrumentation in Spark
Future: New architecture that provides performance clarity
Present: Improving Spark’s performance instrumentation
This talk
![Page 12: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/12.jpg)
spark.textFile(“hdfs://…”) \
.flatMap(lambda l: l.split(“ “)) \
.map(lambda w: (w, 1)) \
.reduceByKey(lambda a, b: a + b) \
.saveAsTextFile(“hdfs://…”)
Example Spark Job
Split input file into words and emit count of 1 for each
Word Count:
![Page 13: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/13.jpg)
Example Spark Job
Split input file into words and emit count of 1 for each
Word Count:
For each word, combine the counts, and save the output
spark.textFile(“hdfs://…”) \
.flatMap(lambda l: l.split(“ “)) \
.map(lambda w: (w, 1)) \
.reduceByKey(lambda a, b: a + b) \
.saveAsTextFile(“hdfs://…”)
![Page 14: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/14.jpg)
spark.textFile(“hdfs://…”)
.flatMap(lambda l: l.split(“ “))
.map(lambda w: (w, 1))
Map Stage: Split input file into words and emit count of 1 for each
Reduce Stage: For each word, combine the counts, and save the output
Spark Word Count Job: .reduceByKey(lambda a, b: a + b)
.saveAsTextFile(“hdfs://…”)
…
Worker 1
Worker n
Tasks
…
Worker 1
Worker n
![Page 15: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/15.jpg)
Spark Word Count Job:
Reduce Stage: For each word, combine the counts, and save the output
.reduceByKey(lambda a, b: a + b)
.saveAsTextFile(“hdfs://…”)
…
Worker 1
Worker n
![Page 16: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/16.jpg)
compute
network
time
(1) Request a few shuffle blocks
disk
(5) Continue fetching remote data
: time to handle one shuffle block
(2) Process local data
What happens in a reduce task?
(4) Process data fetched remotely
(3) Write output to disk
![Page 17: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/17.jpg)
compute
network
time disk
: time to handle one shuffle block What happens in a reduce task?
Bottlenecked on network and disk
Bottlenecked on network
Bottlenecked on CPU
![Page 18: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/18.jpg)
compute
network
time disk
: time to handle one shuffle block What happens in a reduce task?
Bottlenecked on network and disk
Bottlenecked on network
Bottlenecked on CPU
![Page 19: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/19.jpg)
compute
network
time disk
What instrumentation exists today? Instrumentation centered on single, main task thread
: shuffle read blocked time : executor
computing time (!)
![Page 20: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/20.jpg)
actual What instrumentation exists today?
timeline version
![Page 21: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/21.jpg)
What instrumentation exists today? Instrumentation centered on single, main task thread
Shuffle read and shuffle write blocked time Input read and output write blocked time not instrumented
Possible to add!
![Page 22: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/22.jpg)
compute
disk
Instrumenting read and write time
Process shuffle block This is a lie
compute
Reality: Spark processes and then writes one record at a time
Most writes get buffered Occasionally the buffer is flushed
![Page 23: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/23.jpg)
compute Spark processes and then writes one record at a time
Most writes get buffered Occasionally the buffer is flushed
Challenges with reality: Record-level instrumentation is too high overhead
Spark doesn’t know when buffers get flushed (HDFS does!)
![Page 24: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/24.jpg)
Tasks use fine-grained pipelining to parallelize resources
Instrumented times are blocked times only (task is doing other things in background)
Opportunities to improve instrumentation
![Page 25: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/25.jpg)
Past: Performance instrumentation in Spark
Future: New architecture that provides performance clarity
Present: Improving Spark’s performance instrumentation
This talk
![Page 26: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/26.jpg)
Task 1
Task 2
Task 5
Task 3
Task 4
Task 7
Task 6
Task 8 time
4 concurrent tasks on a worker
![Page 27: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/27.jpg)
Task 1
Task 2
Task 5
Task 3
Task 4
Task 7
Task 6
Task 8 time
Concurrent tasks may contend for
the same resource (e.g., network)
![Page 28: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/28.jpg)
What’s the bottleneck?
Task 1
Task 2
Task 5
Task 3
Task 4
Task 7
Task 6
Task 8
Time t: different tasks may be
bottlenecked on different resources
Single task may be bottlenecked on
different resources at different times
![Page 29: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/29.jpg)
Task 1
Task 2
Task 5
Task 3
Task 4
Task 7
Task 6
Task 8
How much faster would my job be with 2x disk throughput?
How would runtimes for these disk writes change?
How would that change timing of (and contention for) other resources?
![Page 30: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/30.jpg)
Today: tasks use pipelining to parallelize multiple resources
Proposal: build systems using monotasks
that each consume just one resource
![Page 31: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/31.jpg)
Monotasks: Each task uses one resource Network
monotask Disk monotask Compute monotask
Today’s task:
Monotasks don’t start until all dependencies complete
Task 1
Network read CPU
Disk write
![Page 32: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/32.jpg)
Dedicated schedulers control contention
Network scheduler
CPU scheduler: 1 monotask / core
Disk drive scheduler: 1 monotask / disk
Monotasks for one of today’s tasks:
![Page 33: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/33.jpg)
Spark today: Tasks have non-
uniform resource use
4 multi-resource tasks run
concurrently
Single-resource monotasks
scheduled by per-resource schedulers
Monotasks: API-compatible, performance
parity with Spark
Performance telemetry trivial!
![Page 34: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/34.jpg)
How much faster would job run if...
4x more machines
Input stored in-memory No disk read
No CPU time to deserialize
Flash drives instead of disks Faster shuffle read/write time 10x improvement predicted
with at most 23% error
�������������
����������������
��� ����� �������� �� ����� �� ���������������
��������
�������� ������� �� ��������� ���� ���� ��������������� ��� ������� ��� ��������� ���� ���� ������
������ ��� ������� ��� ��������� ���� ���� ������
![Page 35: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/35.jpg)
Monotasks: Break jobs into single-resource tasks
Using single-resource monotasks provides clarity without sacrificing performance
Massive change to Spark internals (>20K lines of code)
![Page 36: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/36.jpg)
Past: Performance instrumentation in Spark
Future: New architecture that provides performance clarity
Present: Improving Spark’s performance instrumentation
This talk
![Page 37: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/37.jpg)
Spark today: Task resource use
changes at fine time granularity
4 multi-resource tasks run
concurrently
Monotasks: Single-resource tasks lead
to complete, trivial performance metrics
Can we get monotask-like per-resource metrics for each task today?
![Page 38: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/38.jpg)
compute
network
time disk
Can we get per-task resource use?
Measure machine resource utilization?
![Page 39: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/39.jpg)
Task 1
Task 2
Task 5
Task 3
Task 4
Task 7
Task 6
Task 8 time
4 concurrent tasks on a worker
![Page 40: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/40.jpg)
Task 1
Task 2
Task 5
Task 3
Task 4
Task 7
Task 6
Task 8 time
Concurrent tasks may contend for
the same resource (e.g., network)
Contention controlled by
lower layers (e.g., operating system)
![Page 41: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/41.jpg)
compute
network
time disk
Can we get per-task resource use?
Machine utilization includes other tasks
Can’t directly measure
per-task I/O: in background, mixed
with other tasks
![Page 42: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/42.jpg)
compute
network
time disk
Can we get per-task resource use?
Existing metrics: total data read
How long did it take?
Use machine utilization metrics to get bandwidth!
![Page 43: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/43.jpg)
Existing per-task I/O counters (e.g., shuffle bytes read) +
Machine-level utilization (and bandwidth) metrics =
Complete metrics about time spent using each resource
![Page 44: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/44.jpg)
Goal: provide performance clarity Only way to improve performance is to know what to speed up
Why do we care about performance clarity?
Typical performance eval: group of experts Practical performance: 1 novice
![Page 45: Apache Spark Performance: Past, Future and Present](https://reader036.vdocument.in/reader036/viewer/2022062413/5a64761a7f8b9afc4d8b45db/html5/thumbnails/45.jpg)
Goal: provide performance clarity Only way to improve performance is to know what to speed up
Some instrumentation exists already Focuses on blocked times in the main task thread
Many opportunities to improve instrumentation (1) Add read/write instrumentation to lower level (e.g., HDFS)
(2) Add machine-level utilization info (3) Calculate per-resource time
More details at kayousterhout.org