making sense of performance in data analytics...
TRANSCRIPT
![Page 1: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/1.jpg)
Making Sense of Performance in Data
Analytics FrameworksAuthors: Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy,
Scott Shenker, Byung-Gon Chun
Presenter: Zi Wang
![Page 2: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/2.jpg)
Why?
• Commonly Accepted mantras
• Network
• IO/disk
• Straggler
![Page 3: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/3.jpg)
Takeways
• Network can reduce job completion time by at 2%
• I/O optimizations lead to <19% reduction in completion time
• Many straggler causes can be identified and fixed
• CPU is in general the bottleneck
![Page 4: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/4.jpg)
Outline
• Methodology
• Results
• Threats to validity
![Page 5: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/5.jpg)
What is the job’s bottleneckNetworkComputeDisk
tasks
time
Time t: different tasks may be bottlenecked on different resources
Task x: may be bottlenecked on different resources at different times
![Page 6: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/6.jpg)
Blocked Time Analysis
• Time when task is blocked on one resource (e.g network)
• Blocked time analysis: how much faster would the job complete if tasks never blocked on the resource?
![Page 7: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/7.jpg)
An Example of Blocked Time Analysis for Network
(1) Measure time when tasks are blocked
on the network
tasks
(2) Simulate how job completion time would change
![Page 8: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/8.jpg)
![Page 9: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/9.jpg)
Scheduler would have moved Task 2 to slot 2
Blocked time analysis: how quickly could a job
have completed if a resource were infinitely fast?
![Page 10: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/10.jpg)
Experiments Setting
• Big Data Benchmark, 50 queries, 50GB Data, 5 machines
• TPC-DS (Scale 5000), 260 queries, 850GB Data, 20 machines
• Production, 30 queries, tens of GB Data, 9 machines
![Page 11: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/11.jpg)
Experiments Setting
• All three workloads are Spark-SQL workloads
• Coarse-grained analysis of traces from Facebook, Google, Microsoft are used for sanity check
![Page 12: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/12.jpg)
![Page 13: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/13.jpg)
Are jobs network-light?
![Page 14: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/14.jpg)
Analysis
• Queries often shuffle and output much less data than they read
• However, the result seems inconsistent from previous work…
![Page 15: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/15.jpg)
Two Reasons
• Incomplete Metric
• Only look at shuffle time
• Conflation of CPU and network time
• Sending data over the network has an associated CPU cost
![Page 16: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/16.jpg)
Analysis for I/O
• Compressed data is used, CPU is traded for I/O
• Spark is written in Scala. Data read must be deserialized to Java Objects.
![Page 17: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/17.jpg)
Role of Straggler
• The median reduction from eliminating straggler < 10%
• Common causes: garbage collection, I/O
• Many Stragglers are caused by inherent factors like output size
![Page 18: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/18.jpg)
Threats to Validity
• Only One Framework (Spark)
• Small cluster sizes
• Only three workloads
![Page 19: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/19.jpg)
Related work• Instead of using Spark, using Naiad can achieve
up to 3x speedups going from 1G network to 10G network
• Spark is also memory-efficient, leveraging “in-memory” computation
• Modern hardware (I/O, network links) are also more improved compared to CPU
![Page 20: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/20.jpg)
Comparison to Pivot Tracing
• Static v.s. Dynamic
• Resource Directed Analysis v.s. Crossing Boundaries Analysis
![Page 21: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/21.jpg)
References• Making Sense of Performance in Data Analytics
Frameworks
• Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems
• The impact of fast networks on graph analytics
• Project Tungsten: Bringing Apache Spark Closer to Bare Metal
![Page 22: Making Sense of Performance in Data Analytics Frameworkspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-making-zi.pdf · Making Sense of Performance in Data Analytics Frameworks Authors:](https://reader033.vdocument.in/reader033/viewer/2022042808/5f886878312b932b271639e0/html5/thumbnails/22.jpg)
–Larry Ellison
“The only way to get ahead is to find errors in conventional wisdom.”