optimum system balance for systems of finite price john d. mccalpin, ph.d. ibm corporation austin,...
TRANSCRIPT
Optimum System Balance for Systems of Finite Price
John D. McCalpin, Ph.D.IBM Corporation
Austin, TX
SuperComputing 2004Pittsburgh, PA
November 10, 2004
Overview
• The HPC Challenge Benchmark was announced last year at SuperComputing’2003
• The HPC Challenge Benchmark consists of– LINPACK (HPL)– STREAM– PTRANS (transposing the array used by HPL)– RandomAccess (random read/modify/write)– FFT– and some low-level MPI latency & BW measurements
• No single figure of merit is defined
Overview (continued)
• Q: What is a “balanced” system?
• My answer:“A balanced system is one for which the primary applications are limited in performance by the most expensive component(s) of the system.”
The Two Questions
• We need to understand both performance and cost in the context of low-level component metrics
• Performance– What performance model?– Use a harmonically weighted, time-based model
• Cost– What cost model?– Simple linear additive cost model
Performance Model
• Composite Figures of Merit for Performance must be based on “time” rather than “rate”– i.e., weighted harmonic means of rates
• Why?– Combining “rates” in any other way fails to have a
“Law of Diminishing Returns”
• Time = Work/Rate• Repeat for each component: Ti = Wi/Ri
A Simple Composite Model
• Assume the time to solution is composed of a compute time proportional to peak GFLOPS plus a memory transfer time proportional to sustained memory bandwidth
• Assume “x Bytes/FLOP” to get:
• Target SPECfp_rate2000 as the workload
GB/s SustainedBytes
GFLOPSPeak op FP 1
op" FP Effective" 1 GFLOPS" Balanced"
x
Does Peak GFLOPS predict SPECfp_rate2000?SPECfp_rate2000 vs Peak MFLOPS
0
1000
2000
3000
4000
5000
6000
7000
8000
0.00 5.00 10.00 15.00 20.00 25.00 30.00
SPECfp_rate2000/cpu
Pe
ak
MF
LO
PS
Does Sustained Memory Bandwidth predict SPECfp_rate2000?
SPECfp_rate2000 vs Sustained BW
0.000
1.000
2.000
3.000
4.000
5.000
6.000
7.000
0.00 5.00 10.00 15.00 20.00 25.00 30.00
SPECfp_rate2000/cpu
GB
/s p
er
CP
U
Optimized Model Results
• Results rounded to nearby round values:– Bytes/FLOP for large caches === 0.16– Bytes/FLOP for small caches === 0.80– Size of asymptotically large cache === ~12 MB– Coefficient of best fit === ~6.4– The units of the coefficient are
SPECfp_rate2000 / Effective GFLOPS
Does this Revised Metric predict SPECfp_rate2000?Optimized SPECfp_rate2000 Estimates
0.00
5.00
10.00
15.00
20.00
25.00
30.00
0.00 5.00 10.00 15.00 20.00 25.00 30.00
SPECfp_rate2000/cpu
Es
tim
ate
d R
ate
/cp
u
Cost Model
• Assume simple linear additive model– FLOPS cost some amount
– Sustained BW costs a different amount
– Define:
= Rmem / Rcpu
= Wmem / Wcpu
= ($/BW) / ($/FLOPS)
System characteristics
Application characteristics
Technology characteristics
How to Optimize?
• For a given application (Wmem/Wcpu), what is the optimum system balance ?
• Many people seem to believe that the system should be “balanced” to the application:– optimal =
i.e.,
– Rmem/Rcpu = Wmem/Wcpu
• This does not optimize price/performance
The Correct Optimization
• This is actually an easy optimization problem• Minimize cost/performance
– Same as minimizing cost * time
• Optimum cost/performance occurs at – = sqrt(/)
• Definitely not intuitive!
Example: High BW, expensive BWgamma = 3, delta = 3
0.000
0.500
1.000
1.500
2.000
2.500
3.000
0.10 1.00 10.00
beta
rela
tive
pri
ce/p
erfo
rman
ce
= 3 relatively high BW = 3 relatively expensive BW
Optimum price/performance is at =1, not =3Improvement in price/performance is ~30%
High BW, very expensive memorygamma = 3, delta = 10
0.000
0.500
1.000
1.500
2.000
2.500
3.000
3.500
0.10 1.00 10.00
beta
rela
tive
pri
ce/p
erfo
rman
ce
= 3 relatively high BW = 10 very expensive BW
Optimum price/performance is at =0.58, not =3Improvement in price/performance is ~50%
Low-BW, expensive BWgamma = 0.1, delta = 3
0.000
2.000
4.000
6.000
8.000
10.000
12.000
14.000
0.10 1.00 10.00
beta
rela
tive
pri
ce/p
erfo
rman
ce
= 0.1 low BW application = 3 moderately expensive BW
Optimum price/performance is at =0.18, not =0.1Improvement in price/performance is ~5%
More BW helps here even though it is expensive, because the application does not need much.
Medium BW, expensive BWgamma = 1, delta = 3
0.000
0.500
1.000
1.500
2.000
2.500
3.000
3.500
4.000
4.500
0.10 1.00 10.00
beta
rela
tive
pri
ce/p
erfo
rman
ce
= 1 modest BW = 3 moderately expensive BW
Optimum price/performance is at =0.58, not =1Improvement in price/performance is ~10%
Summary
• Balance is important to cost/performance
• You must understand performance
• You must understand cost
• Optimum cost-performance is not intuitive!