benchmarks on bg/l: parallel and serial

Benchmarks on BG/L: Parallel and Serial

John A. GunnelsMathematical Sciences Dept.

IBM T. J. Watson Research Center

Overview Single node benchmarks

Architecture Algorithms

Linpack Dealing with a bottleneck Communication operations

Benchmarks of the Future

Compute Node: BG/L Dual Core Dual FPU/SIMD

Alignment issues Three-level cache

Pre-fetching Non-coherent L1 caches

32 KB, 64-way, Round-Robin L2 & L3 caches coherent

Outstanding L1 misses (limited)

Programming OptionsHigh Low Level

Compiler optimization to find SIMD parallelism User input for specifying memory alignment and lack of

aliasing alignx assertion disjoint pragma

Dual FPU intrinsics (“built-ins”) Complex data type used to model pair of double-precision

numbers that occupy a (P, S) register pair Compiler responsible for register allocation and scheduling

In-line assembly User responsible for instruction selection, register allocation,

and scheduling

BG/L SIngle-Node STREAM Performance (444 MHz)28 July 2003

0 500000 1000000 1500000 2000000

Vector size (8-byte elements)

Tuned copy (MB/s)

Tuned scale (MB/s)

Tuned add (MB/s)

Tuned triad (MB/s)

OOB copy (MB/s)

OOB scale (MB/s)

OOB add (MB/s)

OOB triad (MB/s)

STREAM Performance

Out-of-box performance is 50-65% of tuned performance Lessons learned in

tuning will be transferred to compiler where possible

Comparison with commodity microprocessors is competitive

Machine Frequency STREAM FP peak Balance

(MHz) (MB/s) (Mflop/s) (B/F)

Intel Xeon 3060 2900 6120 0.474

BG/L 444 2355 3552 0.663

BG/L 670 3579 5360 0.668

AMD Opteron 1800 3600 3600 1.000

DAXPY Bandwidth Utilization

Memory Bandwidth Utilization vs Vector Size for Different Implementations of DAXPY

0 10000 20000 30000 40000 50000 60000 70000 80000 90000

Vector Size (bytes)

Intrinsics

Assembly

Vanilla

5.3 bytes/cycle(L3 – bandwidth)

16 bytes/cycle(L1 – bandwidth)

Matrix MultiplicationTiling for Registers (Analysis)

Latency tolerance (not bandwidth) Take advantage of register count

Unroll by factor of two 24 register pairs 32 cycles per unrolled iteration 15 cycle load-to-use latency (L2 hit)

Could go to 3-way unroll if needed 32 register pairs 32 cycles per unrolled iteration 31 cycle load-to-use latency

Recursive Data Format Mapping 2-D (Matrix) to

1-D (RAM) C/Fortran do not map well

Space-Filling Curve Approximation Recursive Tiling

Enables Streaming/pre-fetching Dual core “scaling”

Register Set

Blocks

L1 Cache Blocks

L3 Cache Blocks

Dual Register Blocks

Dual Core

Why? It’s a effortless way to double your

performance

Dual Core

Why? It exploits the architecture and

may allow one to double the performance of their code in some cases/regions

Single-Node DGEMM Performance at 92% of Peak

Single-node DGEMM (444 MHz)18 July 2003

0 50 100 150 200 250

Matrix size (N)

Single core (GF/s)

Dual core (GF/s)

Single core peak (GF/s)

Dual core peak (GF/s)

Near-perfect scalability (1.99) going from single-core to dual-core Dual-core code delivers 92.27% of peak flops (8 flop/pclk) Performance (as fraction of peak) competitive with that of Power3

and Power4

92.27%

Performance Scales Linearly with Clock Frequency

0 200000

900000

16000000

Frequency (MHz)

N (elts)

Speed test of STREAM COPY, 25 July 2003

Measured performance of DGEMM and STREAM scale linearly with frequency DGEMM at 650 MHz delivers 4.79 Gflop/s STREAM COPY at 670 MHz delivers 3579 MB/s

640 16

Performance (GFlop/s)

Frequency (MHz)

N (elts)

Speed test of DGEMM, 25 July 2003

The Linpack Benchmark

LU Factorization: Brief Review

Alreadyfactored

Pivot and scalecolumns

Current block

LINPACKProblem Mapping

...16nrepetitions n repetitions

Panel Factorization: Option #1 Stagger the computations PF Distributed over relatively few

processors May take as long as several DGEMM

updates DGEMM load imbalance

Block size trades balance for speed Use collective communication primitives

May require no “holes” in communication fabric

Speed-up Option #2

Change the data distribution Decrease the critical path length Consider the communication abilities of

machine Complements Option #1 Memory size (small favors #2; large #1)

Memory hierarchy (higher latency: #1) The two options can be used in concert

Communication Routines

Broadcasts precede DGEMM update Needs to be architecturally aware

Multiple “pipes” connect processors Physical to logical mapping Careful orchestration is required to

take advantage of machines considerable abilities

Row BroadcastMesh

Recv 2Send 4Hot Spot!

Row BroadcastMesh

Recv 2Send 3

Row BroadcastTorus

Row BroadcastTorus (sorry for the “fruit salad”)

Broadcast Bandwidth/Latency

Bandwidth: 2 bytes/cycle per wire Latency:

Sqrt(p), pipelined (large msg.) Deposit bit: 3 hops

Mesh Recv 2/Send 3

Torus Recv 4/Send 4 (no “hot spot”) Recv 2/Send 2 (red-blue only … again, no bottleneck)

Pipe Recv/Send: 1/1 on mesh; 2/2 on torus

What Else? It’s a(n) …

FPU Test Memory Test Power Test Torus Test Mode Test

Conclusion

1.435 TF Linpack #73 in TOP500 List (11/2003) Limited Machine Access Time

Made analysis (prediction) more important 500 MHz Chip

1.507 TF run at 525MHz demonstrates scaling Would achieve >2 TF at 700MHz

1TF even if machine used in “true” heater mode

Conclusion

1.4 TF Linpack on BG/L Prototype: Components

81.35%

2.07%0.15%0.52%0.03%0.09%

1.55%0.01%

3.39%0.48%

BcastA

BcastD

Unpack

Idamax

pdgemm

FWDPiv

BackPiv

Waiting

Additional Conclusions Models, extrapolated data

Use models to the extent that the architecture and algorithm are understood

Extrapolate from small processor sets Vary as many (yes) parameters as possible

at the same time Consider how they interact and how they don’t Also remember that instruments affect timing

Often can compensate (incorrect answer results) Utilize observed “eccentricities” with

caution (MPI_Reduce)

Current Fronts

HPC Challenge Benchmark Suite STREAMS, HPL, etc.

HPCS Productivity BenchmarksMath Libraries

Focused Feedback to Toronto PERCS Compiler/Persistent

Optimization Linpack Algorithm on Other Machines

Thanks to …

Leonardo Bachega: BLAS-1, performance results

Sid Chatterjee, Xavier Martorell: Coprocessor, BLAS-1

Fred Gustavson, James Sexton: Data structure investigations, design, sanity tests

Thanks to … Gheorghe Almasi, Phil

Heidelberger & Nils Smeds: MPI/Communications

Vernon Austel: Data copy routines Gerry Kopcsay & Jose Moreira:

System & machine configuration Derek Lieber & Martin Ohmacht:

Refined memory settings Everyone else: System software,

hardware, & Machine time!

Benchmarks on BG/L: Parallel and Serial

John A. GunnelsMathematical Sciences Dept.

IBM T. J. Watson Research Center

benchmarks on bg/l: parallel and serial

register allocation

futurecompute node

coherent l1 caches32

memory alignment

simd parallelismuser

inscomplex data type

gunnelsmathematical

instruction selection

Documents

financial benchmarks

bevel gearboxes bg 86 bg 110 bg 134 bg 166 bg 200 bg 250 ·...

stihl bg 56, 66, 86, sh 56, 86 - haendel-in.de · bg 56, bg...

arcview print job - michigan.gov · 9802 9801 9805 9804...

videocard benchmarks - warsaw university of technology ·...

eci telecombg-30 bg-20 and bg-64

acid in the water - kent state...

edge benchmarks

more than just products bg 65 / alles aus einer hand bg 65...

benchmarks - june, 2013 | benchmarks...

cpu benchmarks - zszgorlice.iap.pl · cpu benchmarks video...

arcview print job - michigan 1 bg 1 bge 1 bg 2 bg 3 bg 1 bgc...

en1993 benchmarks

cpu benchmarks

· eco-friendly soft pvc material anti-slip back 100% food...

mixer amplifiers bg-220 bg-235 - toa...

qcn serial-hai simulation benchmarks and qeq - ieee · pdf...

201408 bg bg

bg-volterra integral equations and relationship with bg

ltc4444 (rev. c) - analog devices€¦ · low side gate...