to share or not to share? ryan johnson nikos hardavellas, ippokratis pandis, naju mancheril, stavros...
TRANSCRIPT
To Share or Not to Share?
Ryan Johnson
Nikos Hardavellas, Ippokratis Pandis, Naju Mancheril, Stavros Harizopoulos**, Kivanc Sabirli,
Anastasia Ailamaki, Babak Falsafi
PARALLEL DATA LABORATORYCarnegie Mellon University
**HP LABS
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 2
Motivation For Work Sharing
scan
join
aggregate
scan
output
StudentDept
Query: What is the highest undergraduate GPA?
aggregate
output
scan
Student
Query: What is the average GPA in the ECE dept.?
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 3
Motivation For Work Sharing• Many queries in
system• Similar requests• Redundant work
• Work Sharing• Detect
redundant work• Compute results
once and share
• Big win for I/O, uniprocessors
scan
join
aggregate
scan
output
StudentDept
aggregate
output
scan
Student
2x speedup for TPC-H queries [hariz05]
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 4
0.0
0.5
1.0
1.5
2.0
0 15 30 45Shared Queries
Sp
eed
up
du
e to
WS
1 CPU
7x
Work Sharing on Modern Hardware
8 CPU
Memory
L1
L2
core
L1
L2
core
L2
Memory
Work sharing can hurt performance!
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 5
Contributions
• Observation• Work sharing can hurt performance on parallel
hardware
• Analysis• Develop intuitive analytical model of work sharing• Identify trade-off between total work, critical path
• Application• Model-based policy outperforms static ones by
up to 6x
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 6
Outline
• Introduction• Part I: Intuition and Model• Part II: Analysis and Experiments
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 7
Challenges of Exploiting Work Sharing
• Independent execution only?• Load reduction from work sharing can be useful
• Work sharing only?• Indiscriminate application can hurt performance
• To share or not to share? • System and workload dependent • Adapt decisions at runtime
Must understand work sharing to exploit it fully
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 8
Work Sharing vs. Parallelism
Query 2
Independent Execution
Query 1
Query 2 response time
Query 1 response time
Scan
Join
Aggregate
P = 4.33
Critical Paths
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 9
Work Sharing vs. Parallelism
Query 2
Query 1
Query 2 response time
Query 1 response time
Shared Execution
Penalty
Scan
Join
Aggregate
P = 2.75
P = 4.33
Critical path now longer
Total work and critical path both important
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 10
Understanding Work Sharing
• Performance depends on two factors:
• Work sharing presents a trade-off• Reduces total work• Potentially lengthens critical path
Balance both factors or performance suffers
thCriticalPaTotalWorkfThroughput
1,
1
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 11
Basis for a Model
• “Closed” system• Consistent high load• Throughput computing• Assumed in most benchmarks• Fixed number of clients
• Little’s Law governs throughput• Higher response time = lower throughput• Total work not a direct factor!
Load reduction secondary to response time
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 12
Predicting Response Time
• Case 1: Compute-bound
• Case 2: Critical path-bound
• Larger bottleneck determines response time
stage pipeslowest at Delay max pTime
Processors Available
on UtilizatiRequested
n
uTime
Model provides u and pmax
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 13
An Analytical Model of Work Sharing
Sharing helpful when Xshared > Xalone
)(
1,
)(min),(
max mpmu
nmnmX
Throughput for m queries and n processors
Potentially worsened by work sharing
][][
TE
mXE
Improved by work sharing
U = requested utilizationPmax = longest pipe stage
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 14
Outline
• Introduction• Part I: Intuition and Model• Part II: Analysis and Experiments
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 15
Experimental Setup• Hardware
• Sun T2000 “Niagara” with 16 GB RAM• 8 cores (32 threads) • Solaris processor sets vary effective CPU count
• Cordoba• Staged DBMS• Naturally exposes work sharing• Flexible work sharing policies
• 1GB TPCH dataset• Fixed Client and CPU counts per run
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 16
Predicted vs. Measured Performance
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 15 30 45Shared Queries
Sp
eed
up
du
e to
WS 1 CPU
1 CPU model
2 CPU2 CPU model
8 CPU8 CPU model
32 CPU32 CPU model
Model Validation: TPCH Q1
Avg/max error: 5.7% / 22%
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 17
Predicted vs. Measured Performance
0
10
20
30
0 15 30 45Clients
Sp
eed
up
du
e to
WS 1 CPU
1 CPU model
2 CPU2 CPU model
8 CPU8 CPU model
32 CPU32 CPU model
Model Validation: TPCH Q4
Behavior varies with both system and workload
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 18
Exploring WS vs. Parallelism• Work sharing splits query
into three parts• Independent work
– Per-query, parallel– Total work
• Serial work– Per-query, serial– Critical path
• Shared work– Computed once– “Free” after first query
Serial - 4% Independent - 37%
Shared - 59%
Example:
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 19
0
0.5
1
1.5
2
2.5
0 8 16 24 32Shared Queries
Ben
efit
fro
m W
ork
Sh
arin
g
4
8
16
32
Exploring WS vs. Parallelism
Behavior matches previously published results
Potential Speedup
CPUs
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 20
0
0.5
1
1.5
2
2.5
0 8 16 24 32Shared Queries
Ben
efit
fro
m W
ork
Sh
arin
g
4
8
Exploring WS vs. Parallelism
Potential Speedup
CPUs
Saturated
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 21
0
0.5
1
1.5
2
2.5
0 8 16 24 32Shared Queries
Ben
efit
fro
m W
ork
Sh
arin
g
4
8
16
32
Exploring WS vs. Parallelism
Potential Speedup
CPUs
Saturated
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 22
0
0.5
1
1.5
2
2.5
0 8 16 24 32Shared Queries
Ben
efit
fro
m W
ork
Sh
arin
g
4
8
16
32
Exploring WS vs. Parallelism
More processors shift bottleneck to critical path
Saturated
Potential Speedup
CPUs
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 23
Performance Impact of Serial Work
Critical path quickly becomes major bottleneck
Impact of Serial Work (32 CPU)
0
0.5
1
1.5
2
2.5
0 10 20 30 40Shared Queries
Ben
efit
fro
m W
ork
Sh
arin
g
0%
1%
2%7%
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 24
Model-guided Work Sharing
• Integrate predictive model into Cordoba• Predict benefit of work sharing for each new query
• Consider multiple groups of queries at once• Shorter critical path, increased parallelism
• Experimental setup• Profile run with 2 clients, 2 CPUs• Extract model parameters with profiling tools• 20 clients submit mix of TPCH Q1 and Q4
Compare against always-, never-share policies
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 25
Comparison of Work Sharing Strategies
Model-based policy balances critical path and load
2 CPU
0
50
100
150
200
All Q1 50/50 All Q4
Query Ratio
Qu
eri
es/m
in
always sharemodel guidednever share
32 CPU
0
50
100
150
200
250
Qu
eri
es/m
inAll Q1 50/50 All Q4
Query Ratio
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 26
Related Work
• Many existing work sharing schemes• Identification occurs at different stages in the
query’s lifetime• All allow pipelined query execution
Materialized Views
[rouss82]
Multiple Query Optimization
[roy00]
Staged DBMS[hariz05]
Synchronized Scanning[lang07]
Schema design
Buffer Pool Access
Query compilation
Query execution
Early Late
Model describes all types of work sharing
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 27
Conclusions
• Work sharing can hurt performance• Highly parallel, memory resident machines
• Intuitive analytical model captures behavior• Trade-off between load reduction and critical path
• Model-guided work sharing highly effective• Outperforms static policies by up to 6x
http://www.cs.cmu.edu/~StagedDB/
Ryan Johnson © Apr 18, 2023 http://www.pdl.cmu.edu/ 28
References
•[hariz05] S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. “QPipe: A Simultaneously Pipelined Relational Query Engine.” In Proc. SIGMOD, 2005.
•[lang07] C. Lang, B. Bhattacharjee, T. Malkemus, S. Padmanabhan, and K. Wong. “Increasing Buffer-Locality for Multiple Relational Table Scans through Grouping and Throttling.” In Proc. ICDE, 2007.•[rouss82] N. Roussopoulos. “View Indexing in Relational databases.” In ACM TODS, 7(2):258-290, 1982.•[roy00] P. Roy, S. Seshadri, S. Sudarshan, and S. Bhobe. “Efficient and Extensible Algorithms for Multi Query Optimization.” In Proc. SIGMOD, 2000.
http://www.cs.cmu.edu/~StagedDB/