the stream benchmarkmccalpin/stream_benchmark... · 2005. 1. 25. · ^performance the stream...

21
^ Performance The STREAM Benchmark The STREAM Benchmark John D. McCalpin, Ph.D. John D. McCalpin, Ph.D. IBM eServer Performance IBM eServer Performance 2005-01-27 2005-01-27

Upload: others

Post on 31-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • ^ Performance

    The STREAM BenchmarkThe STREAM Benchmark

    John D. McCalpin, Ph.D.John D. McCalpin, Ph.D.IBM eServer PerformanceIBM eServer Performance

    2005-01-272005-01-27

  • ^ Performance

    HistoryHistory Scientific computing was largely based on the Scientific computing was largely based on the

    vector paradigm from the late 1970’s through vector paradigm from the late 1970’s through the 1980’sthe 1980’s E.g., the classic DAXPY loop:E.g., the classic DAXPY loop:

    REAL*8 X(N),Y(N)REAL*8 X(N),Y(N)DO I=1,NDO I=1,NX(I) = X(I) + Q*Y(I)X(I) = X(I) + Q*Y(I)

    END DOEND DO If all data comes from memory, this requires 12 If all data comes from memory, this requires 12

    Bytes per FLOP (=3*8Bytes/2FLOPS)Bytes per FLOP (=3*8Bytes/2FLOPS)

  • ^ Performance

    History (cont’d)History (cont’d) Vector machines like the Cray X/MP, Cray Vector machines like the Cray X/MP, Cray

    Y/MP, Cray C90, CDC Cyber 205, ETA-10, Y/MP, Cray C90, CDC Cyber 205, ETA-10, and various Japanese vector systems were built and various Japanese vector systems were built to this target of 12 Bytes/second of peak to this target of 12 Bytes/second of peak memory bandwidth per peak FLOPmemory bandwidth per peak FLOP

    Easy to sustain 30%-40% of peak performance Easy to sustain 30%-40% of peak performance for vectorizable code.for vectorizable code. >90% on LINPACK>90% on LINPACK

  • ^ Performance

    History (cont’d)History (cont’d) 1990: The Attack of the Killer Micros1990: The Attack of the Killer Micros

    HP PA-RISC (aka “snakes”)HP PA-RISC (aka “snakes”) IBM RS/6000 (aka “RIOS”, aka “POWER”, aka IBM RS/6000 (aka “RIOS”, aka “POWER”, aka

    “POWER1”“POWER1” These systems provided performance roughly These systems provided performance roughly

    matching Cray vector machines on many matching Cray vector machines on many applications, but fell very short on othersapplications, but fell very short on others

    Users were confused….Users were confused….

  • ^ Performance

    The Genesis of STREAMThe Genesis of STREAM I was an assistant professor at the University of I was an assistant professor at the University of

    Delaware in 1990Delaware in 1990 My research was on large-scale ocean My research was on large-scale ocean

    circulation modellingcirculation modelling Vectorizable codes that ran well on vector machinesVectorizable codes that ran well on vector machines RS/6000-320 gave about 4% of Cray Y/MP RS/6000-320 gave about 4% of Cray Y/MP

    performanceperformance The STREAM benchmark was developed as a The STREAM benchmark was developed as a

    proxy for the computational kernels in my proxy for the computational kernels in my ocean modelsocean models

  • ^ Performance

    STREAM DesignSTREAM Design STREAM is designed to measure sustainable STREAM is designed to measure sustainable

    memory bandwidth for contiguous, long-vector memory bandwidth for contiguous, long-vector memory accessesmemory accesses

    Four kernels are tested:Four kernels are tested: COPY: COPY: a(i) = b(i)a(i) = b(i) SCALE:SCALE: a(i) = q*b(i)a(i) = q*b(i) ADD:ADD: a(i) = b(i) + c(i)a(i) = b(i) + c(i) TRIADTRIAD:: a(i) = b(i) + q*c(i)a(i) = b(i) + q*c(i)

    Arrays should all be much larger than cachesArrays should all be much larger than caches

  • ^ Performance

    STREAM Design (cont’d)STREAM Design (cont’d) The code is set up with dependencies between The code is set up with dependencies between

    the kernels to try to prevent excessive the kernels to try to prevent excessive optimizationoptimization

    Recent versions (in FORTRAN and C) have Recent versions (in FORTRAN and C) have self-contained validation codeself-contained validation code

    Recent versions (in FORTRAN and C) have Recent versions (in FORTRAN and C) have OpenMP directives for parallelizationOpenMP directives for parallelization Code is NUMA-friendly with “first touch” memory Code is NUMA-friendly with “first touch” memory

    placementplacement An MPI version is also available (FORTRAN)An MPI version is also available (FORTRAN)

  • ^ Performance

    STREAM HistorySTREAM History STREAM was maintained at my FTP site at the STREAM was maintained at my FTP site at the

    University of Delaware from 1991-1996University of Delaware from 1991-1996 STREAM web/ftp sites moved to the University of STREAM web/ftp sites moved to the University of

    Virginia in 1996 when I moved to Industry (SGI)Virginia in 1996 when I moved to Industry (SGI) Current run rules have several publication classesCurrent run rules have several publication classes

    ““standard” – no source code changes to kernelsstandard” – no source code changes to kernels ““tuned” – modified source codetuned” – modified source code ““MPI” – uses MPI source codeMPI” – uses MPI source code ““32-bit” – uses 32-bit data items and arithmetic32-bit” – uses 32-bit data items and arithmetic

  • ^ Performance

    Is STREAM Useful?Is STREAM Useful? Yes – if its limitations are properly understoodYes – if its limitations are properly understood STREAM is intended as one component of a STREAM is intended as one component of a

    performance model that includes both performance model that includes both processor and memory timingsprocessor and memory timings E.g., TE.g., Ttotaltotal = T = Tcpucpu + T + Tmemorymemory

    Such models are intrinsically application Such models are intrinsically application dependentdependent Often strongly dependent on cache sizeOften strongly dependent on cache size

  • ^ Performance

    The Big QuestionThe Big Question

    Q: How should one think about composite figures of Q: How should one think about composite figures of merit based on such a collection of low-level merit based on such a collection of low-level measurements?measurements?

    A: Composite Figures of Merit must be based on A: Composite Figures of Merit must be based on “time” rather than “rate”“time” rather than “rate” i.e., weighted harmonic means of ratesi.e., weighted harmonic means of rates

    Why?Why? Combining “rates” in any other way fails to have a “Law of Combining “rates” in any other way fails to have a “Law of

    Diminishing Returns”Diminishing Returns”

  • ^ Performance

    Does Peak GFLOPS predict SPECfp_rate2000?Does Peak GFLOPS predict SPECfp_rate2000?SPECfp_rate2000 vs Peak MFLOPS

    0

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    0.00 5.00 10.00 15.00 20.00 25.00 30.00

    SPECfp_rate2000/cpu

    Peak MFLOPS

  • ^ Performance

    Does Sustained Memory Bandwidth predict Does Sustained Memory Bandwidth predict SPECfp_rate2000?SPECfp_rate2000?

    SPECfp_rate2000 vs Sustained BW

    0.000

    1.000

    2.000

    3.000

    4.000

    5.000

    6.000

    7.000

    0.00 5.00 10.00 15.00 20.00 25.00 30.00

    SPECfp_rate2000/cpu

    GB/s per CPU

  • ^ Performance

    A Simple Composite ModelA Simple Composite Model

    Assume the time to solution is composed of a compute time Assume the time to solution is composed of a compute time proportional to peak GFLOPS plus a memory transfer time proportional to peak GFLOPS plus a memory transfer time proportional to sustained memory bandwidthproportional to sustained memory bandwidth

    Assume “x Bytes/FLOP” to get:Assume “x Bytes/FLOP” to get:

    +

    GB/s SustainedBytes

    GFLOPSPeak op FP 1

    op" FP Effective" 1 GFLOPS" Balanced"x

  • ^ Performance

    Does this Revised Metric predict SPECfp_rate2000?Does this Revised Metric predict SPECfp_rate2000?Optimized SPECfp_rate2000 Estimates

    0.00

    5.00

    10.00

    15.00

    20.00

    25.00

    30.00

    0.00 5.00 10.00 15.00 20.00 25.00 30.00

    SPECfp_rate2000/cpu

    Estimated Rate/cpu

  • ^ Performance

    Statistical MetricsStatistical Metrics

    00.10.20.30.40.50.60.70.80.9

    1

    Peak GFLOPS SWIM BW Optimal

    R squared(Std Error)/Mean

  • ^ Performance

    What about other applications?What about other applications?

    Effectiveness of caches varies by application Effectiveness of caches varies by application areaarea

    Requirements for interconnect performance Requirements for interconnect performance vary by application areavary by application area Some apps are short-message dominatedSome apps are short-message dominated Some apps are long-message dominatedSome apps are long-message dominated

    Composite models can be tuned to specific Composite models can be tuned to specific application areas – if app properties knownapplication areas – if app properties known

  • ^ Performance

    Caches Reduce BW demandCaches Reduce BW demand

    2.00

    10.0

    20.0Truncated…..

  • ^ Performance

    An Example Model tuned for CFDAn Example Model tuned for CFD

    Analyze applications and pick reasonable values:Analyze applications and pick reasonable values:

    Two cases tested:Two cases tested: Assume long messages (network BW tracks PTRANS)Assume long messages (network BW tracks PTRANS) Assume short messages (network BW tracks GUPS)Assume short messages (network BW tracks GUPS)

    The relative time contributions will quickly identify The relative time contributions will quickly identify applications that are poorly balanced for the target workloadapplications that are poorly balanced for the target workload

    +

    +

    GB/sNetwork Bytes 0.1

    GB/s STREAMBytes 2

    GFLOPSLINPACK op FP 1

    op" FP Effective" 1 GFLOPS" Balanced"

  • ^ Performance

    Comparing p655 cluster vs p690 SMPComparing p655 cluster vs p690 SMPAssumes long messagesAssumes long messages

    00.0050.01

    0.0150.02

    0.0250.03

    0.0350.04

    0.045

    p655+ cluster p690+ SMP

    Interconnect TimeMemory TimeCompute Time

    p655 cluster 45 GFLOPSp690 SMP 23 GFLOPS

  • ^ Performance

    Comparing p655 cluster vs p690 SMPComparing p655 cluster vs p690 SMPAssumes short messagesAssumes short messages

    0

    0.01

    0.02

    0.03

    0.04

    0.05

    0.06

    0.07

    0.08

    p655+ cluster p690+ SMP

    Interconnect TimeMemory TimeCompute Time

    p655 cluster 13 GFLOPSp690 SMP 21 GFLOPS

  • ^ Performance

    SummarySummary

    The composite methodology isThe composite methodology is Simple to understandSimple to understand Based on the components of the HPC Challenge Based on the components of the HPC Challenge

    BenchmarkBenchmark Based on a mathematically correct model of Based on a mathematically correct model of

    performanceperformance

    Much work remains on documenting the work Much work remains on documenting the work requirements of various application areas in requirements of various application areas in relation to the component microbenchmarksrelation to the component microbenchmarks