autotuning sparse matrix and structured grid kernels

53
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR LAB Autotuning Sparse Matrix and Structured Grid Kernels Samuel Williams 1,2 , Richard Vuduc 3 , Leonid Oliker 1,2 , John Shalf 2 , Katherine Yelick 1,2 , James Demmel 1,2 , Jonathan Carter 2 , David Patterson 1,2 1 University of California Berkeley 2 Lawrence Berkeley National Laboratory 3 Georgia Institute of Technology [email protected]

Upload: dixie

Post on 25-Feb-2016

54 views

Category:

Documents


0 download

DESCRIPTION

Autotuning Sparse Matrix and Structured Grid Kernels. Samuel Williams 1,2 , Richard Vuduc 3 , Leonid Oliker 1,2 , John Shalf 2 , Katherine Yelick 1,2 , James Demmel 1,2 , Jonathan Carter 2 , David Patterson 1,2 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Autotuning Sparse Matrix and  Structured Grid Kernels

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

1

BERKELEY PAR LAB

Autotuning Sparse Matrix and Structured Grid Kernels

Samuel Williams1,2, Richard Vuduc3, Leonid Oliker1,2, John Shalf2, Katherine Yelick1,2, James Demmel1,2, Jonathan Carter2, David Patterson1,2

1University of California Berkeley2Lawrence Berkeley National Laboratory3Georgia Institute of Technology

[email protected]

Page 2: Autotuning Sparse Matrix and  Structured Grid Kernels

2

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Overview Multicore is the de facto performance solution for the next decade

Examined Sparse Matrix Vector Multiplication (SpMV) kernel Important, common, memory intensive, HPC kernel Present 2 autotuned threaded implementations Compare with leading MPI implementation(PETSc) with an autotuned serial

kernel (OSKI)

Examined Lattice-Boltzmann Magneto-hydrodynamic (LBMHD) application memory intensive HPC application (structured grid) Present 2 autotuned threaded implementations

Benchmarked performance across 4 diverse multicore architectures Intel Xeon (Clovertown) AMD Opteron Sun Niagara2 (Huron) IBM QS20 Cell Blade

We show Cell consistently delivers good performance and efficiency Niagara2 delivers good performance and productivity

Page 3: Autotuning Sparse Matrix and  Structured Grid Kernels

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

3

BERKELEY PAR LAB

Autotuning

AutotuningMulticore SMPsHPC KernelsSpMVLBMHDSummary

Page 4: Autotuning Sparse Matrix and  Structured Grid Kernels

4

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Autotuning

Hand optimizing each architecture/dataset combination is not feasible

Autotuning finds a good performance solution be heuristics or exhaustive search Perl script generates many possible kernels Generate SSE optimized kernels Autotuning benchmark examines kernels and reports back with the best

one for the current architecture/dataset/compiler/… Performance depends on the optimizations generated Heuristics are often desirable when the search space isn’t tractable

Page 5: Autotuning Sparse Matrix and  Structured Grid Kernels

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

5

BERKELEY PAR LAB

Multicore SMPs used

AutotuningMulticore SMPsHPC KernelsSpMVLBMHDSummary

Page 6: Autotuning Sparse Matrix and  Structured Grid Kernels

6

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems

667MHz FBDIMMs

Chipset (4x64b controllers)

10.6 GB/s(write)21.3 GB/s(read)

10.6 GB/s

Core2

FSB

Core2 Core2 Core2

10.6 GB/s

Core2

FSB

Core2 Core2 Core2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

512MB XDR DRAM

25.6GB/s

EIB (Ring Network)

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

MFC MFC MFC MFC

256K 256K 256K 256K

SPE SPE SPE SPE

XDR BIF

PPE

512KBL2

512MB XDR DRAM

25.6GB/s

EIB (Ring Network)

BIF XDR

PPE

512KBL2

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

MFC MFC MFC MFC

256K 256K 256K 256K

SPE SPE SPE SPE

<20GB/seach

direction

IBM QS20 Cell BladeSun Niagara2 (Huron)

AMD OpteronIntel Clovertown

Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller

HT1MB

victim1MBvictim

SRI / crossbar

Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller

HT 1MB

victim1MBvictim

SRI / crossbar

4GB

/s(e

ach

dire

ctio

n)

Crossbar Switch

42.66 GB/s (read)

667MHz FBDIMMs

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

21.33 GB/s (write)

179 GB/s (fill)90 GB/s (writethru)

4x128b memory controllers (2 banks each)

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1

Page 7: Autotuning Sparse Matrix and  Structured Grid Kernels

7

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(memory hierarchy)

667MHz FBDIMMs

Chipset (4x64b controllers)

10.6 GB/s(write)21.3 GB/s(read)

10.6 GB/s

Core2

FSB

Core2 Core2 Core2

10.6 GB/s

Core2

FSB

Core2 Core2 Core2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

512MB XDR DRAM

25.6GB/s

EIB (Ring Network)

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

MFC MFC MFC MFC

256K 256K 256K 256K

SPE SPE SPE SPE

XDR BIF

PPE

512KBL2

512MB XDR DRAM

25.6GB/s

EIB (Ring Network)

BIF XDR

PPE

512KBL2

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

MFC MFC MFC MFC

256K 256K 256K 256K

SPE SPE SPE SPE

<20GB/seach

direction

IBM QS20 Cell BladeSun Niagara2 (Huron)

AMD OpteronIntel Clovertown

Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller

HT1MB

victim1MBvictim

SRI / crossbar

Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller

HT 1MB

victim1MBvictim

SRI / crossbar

4GB

/s(e

ach

dire

ctio

n)

Crossbar Switch

42.66 GB/s (read)

667MHz FBDIMMs

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

21.33 GB/s (write)

179 GB/s (fill)90 GB/s (writethru)

4x128b memory controllers (2 banks each)

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1

Conventional Cache-based

Memory Hierarchy

Page 8: Autotuning Sparse Matrix and  Structured Grid Kernels

8

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(memory hierarchy)

667MHz FBDIMMs

Chipset (4x64b controllers)

10.6 GB/s(write)21.3 GB/s(read)

10.6 GB/s

Core2

FSB

Core2 Core2 Core2

10.6 GB/s

Core2

FSB

Core2 Core2 Core2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

512MB XDR DRAM

25.6GB/s

EIB (Ring Network)

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

MFC MFC MFC MFC

256K 256K 256K 256K

SPE SPE SPE SPE

XDR BIF

PPE

512KBL2

512MB XDR DRAM

25.6GB/s

EIB (Ring Network)

BIF XDR

PPE

512KBL2

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

MFC MFC MFC MFC

256K 256K 256K 256K

SPE SPE SPE SPE

<20GB/seach

direction

IBM QS20 Cell BladeSun Niagara2 (Huron)

AMD OpteronIntel Clovertown

Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller

HT1MB

victim1MBvictim

SRI / crossbar

Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller

HT 1MB

victim1MBvictim

SRI / crossbar

4GB

/s(e

ach

dire

ctio

n)

Crossbar Switch

42.66 GB/s (read)

667MHz FBDIMMs

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

21.33 GB/s (write)

179 GB/s (fill)90 GB/s (writethru)

4x128b memory controllers (2 banks each)

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1Conventional Cache-based

Memory Hierarchy

Disjoint Local Store

Memory Hierarchy

Page 9: Autotuning Sparse Matrix and  Structured Grid Kernels

9

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(memory hierarchy)

667MHz FBDIMMs

Chipset (4x64b controllers)

10.6 GB/s(write)21.3 GB/s(read)

10.6 GB/s

Core2

FSB

Core2 Core2 Core2

10.6 GB/s

Core2

FSB

Core2 Core2 Core2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

512MB XDR DRAM

25.6GB/s

EIB (Ring Network)

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

MFC MFC MFC MFC

256K 256K 256K 256K

SPE SPE SPE SPE

XDR BIF

PPE

512KBL2

512MB XDR DRAM

25.6GB/s

EIB (Ring Network)

BIF XDR

PPE

512KBL2

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

MFC MFC MFC MFC

256K 256K 256K 256K

SPE SPE SPE SPE

<20GB/seach

direction

IBM QS20 Cell BladeSun Niagara2 (Huron)

AMD OpteronIntel Clovertown

Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller

HT1MB

victim1MBvictim

SRI / crossbar

Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller

HT 1MB

victim1MBvictim

SRI / crossbar

4GB

/s(e

ach

dire

ctio

n)

Crossbar Switch

42.66 GB/s (read)

667MHz FBDIMMs

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

21.33 GB/s (write)

179 GB/s (fill)90 GB/s (writethru)

4x128b memory controllers (2 banks each)

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1

Cache + Pthreads

implementations

Local Store + libspe

implementations

Page 10: Autotuning Sparse Matrix and  Structured Grid Kernels

10

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(peak flops)

667MHz FBDIMMs

Chipset (4x64b controllers)

10.6 GB/s(write)21.3 GB/s(read)

10.6 GB/s

Core2

FSB

Core2 Core2 Core2

10.6 GB/s

Core2

FSB

Core2 Core2 Core2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

512MB XDR DRAM

25.6GB/s

EIB (Ring Network)

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

MFC MFC MFC MFC

256K 256K 256K 256K

SPE SPE SPE SPE

XDR BIF

PPE

512KBL2

512MB XDR DRAM

25.6GB/s

EIB (Ring Network)

BIF XDR

PPE

512KBL2

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

MFC MFC MFC MFC

256K 256K 256K 256K

SPE SPE SPE SPE

<20GB/seach

direction

IBM QS20 Cell BladeSun Niagara2 (Huron)

AMD OpteronIntel Clovertown

Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller

HT1MB

victim1MBvictim

SRI / crossbar

Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller

HT 1MB

victim1MBvictim

SRI / crossbar

4GB

/s(e

ach

dire

ctio

n)

Crossbar Switch

42.66 GB/s (read)

667MHz FBDIMMs

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

21.33 GB/s (write)

179 GB/s (fill)90 GB/s (writethru)

4x128b memory controllers (2 banks each)

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1

75 Gflop/s(TLP + ILP + SIMD)

17 Gflop/s(TLP + ILP)

PPEs: 13 Gflop/s(TLP + ILP)

SPEs: 29 Gflop/s(TLP + ILP + SIMD)

11 Gflop/s(TLP)

Page 11: Autotuning Sparse Matrix and  Structured Grid Kernels

11

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(peak DRAM bandwidth)

667MHz FBDIMMs

Chipset (4x64b controllers)

10.6 GB/s(write)21.3 GB/s(read)

10.6 GB/s

Core2

FSB

Core2 Core2 Core2

10.6 GB/s

Core2

FSB

Core2 Core2 Core2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

512MB XDR DRAM

25.6GB/s

EIB (Ring Network)

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

MFC MFC MFC MFC

256K 256K 256K 256K

SPE SPE SPE SPE

XDR BIF

PPE

512KBL2

512MB XDR DRAM

25.6GB/s

EIB (Ring Network)

BIF XDR

PPE

512KBL2

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

MFC MFC MFC MFC

256K 256K 256K 256K

SPE SPE SPE SPE

<20GB/seach

direction

IBM QS20 Cell BladeSun Niagara2 (Huron)

AMD OpteronIntel Clovertown

Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller

HT1MB

victim1MBvictim

SRI / crossbar

Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller

HT 1MB

victim1MBvictim

SRI / crossbar

4GB

/s(e

ach

dire

ctio

n)

Crossbar Switch

42.66 GB/s (read)

667MHz FBDIMMs

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

21.33 GB/s (write)

179 GB/s (fill)90 GB/s (writethru)

4x128b memory controllers (2 banks each)

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1

21 GB/s(read)10 GB/s(write)

21 GB/s

51 GB/s42 GB/s(read)21 GB/s(write)

Page 12: Autotuning Sparse Matrix and  Structured Grid Kernels

12

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems

667MHz FBDIMMs

Chipset (4x64b controllers)

10.6 GB/s(write)21.3 GB/s(read)

10.6 GB/s

Core2

FSB

Core2 Core2 Core2

10.6 GB/s

Core2

FSB

Core2 Core2 Core2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

512MB XDR DRAM

25.6GB/s

EIB (Ring Network)

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

MFC MFC MFC MFC

256K 256K 256K 256K

SPE SPE SPE SPE

XDR BIF

PPE

512KBL2

512MB XDR DRAM

25.6GB/s

EIB (Ring Network)

BIF XDR

PPE

512KBL2

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

SPE

256K

MFC

MFC MFC MFC MFC

256K 256K 256K 256K

SPE SPE SPE SPE

<20GB/seach

direction

IBM QS20 Cell BladeSun Niagara2 (Huron)

AMD OpteronIntel Clovertown

Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller

HT1MB

victim1MBvictim

SRI / crossbar

Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller

HT 1MB

victim1MBvictim

SRI / crossbar

4GB

/s(e

ach

dire

ctio

n)

Crossbar Switch

42.66 GB/s (read)

667MHz FBDIMMs

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

21.33 GB/s (write)

179 GB/s (fill)90 GB/s (writethru)

4x128b memory controllers (2 banks each)

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

MTSparc

8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1

Unifo

rm M

emor

y Ac

cess

Non-U

nifor

m Mem

ory A

cces

s

Page 13: Autotuning Sparse Matrix and  Structured Grid Kernels

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

13

BERKELEY PAR LAB

HPC Kernels

AutotuningMulticore SMPsHPC KernelsSpMVLBMHDSummary

Page 14: Autotuning Sparse Matrix and  Structured Grid Kernels

14

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Arithmetic Intensity

Arithmetic Intensity ~ Total Compulsory Flops / Total Compulsory Bytes Many HPC kernels have an arithmetic intensity that scales with with problem size (increasing

temporal locality) But there are many important and interesting kernels that don’t Low arithmetic intensity kernels are likely to be memory bound High arithmetic intensity kernels are likely to be processor bound Ignores memory addressing complexity

A r i t h m e t i c I n t e n s i t y

O( N ) O( log(N) ) O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTs

Dense Linear Algebra (BLAS3)

Particle Methods

Page 15: Autotuning Sparse Matrix and  Structured Grid Kernels

15

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Arithmetic Intensity

Arithmetic Intensity ~ Total Compulsory Flops / Total Compulsory Bytes Many HPC kernels have an arithmetic intensity that scales with with problem size (increasing

temporal locality) But there are many important and interesting kernels that don’t Low arithmetic intensity kernels are likely to be memory bound High arithmetic intensity kernels are likely to be processor bound Ignores memory addressing complexity

A r i t h m e t i c I n t e n s i t y

O( N ) O( log(N) ) O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTs

Dense Linear Algebra (BLAS3)

Particle Methods

Good Match for Clovertown, eDP Cell, … Good Match for Cell, Niagara2

Page 16: Autotuning Sparse Matrix and  Structured Grid Kernels

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

16

BERKELEY PAR LAB

Sparse Matrix-Vector Multiplication (SpMV)

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.

AutotuningMulticore SMPsHPC KernelsSpMVLBMHDSummary

Page 17: Autotuning Sparse Matrix and  Structured Grid Kernels

17

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Sparse MatrixVector Multiplication

Sparse Matrix Most entries are 0.0 Performance advantage in only

storing/operating on the nonzeros Requires significant meta data

Evaluate y=Ax A is a sparse matrix x & y are dense vectors

Challenges Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD) Irregular memory access to source vector Difficult to load balance Very low computational intensity (often >6 bytes/flop)

= likely memory bound

A x y

Page 18: Autotuning Sparse Matrix and  Structured Grid Kernels

18

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Dataset (Matrices)

Pruned original SPARSITY suite down to 14 none should fit in cache Subdivided them into 4 categories Rank ranges from 2K to 1M

Dense

Protein FEM /Spheres

FEM /Cantilever

WindTunnel

FEM /Harbor QCD FEM /

Ship Economics Epidemiology

FEM /Accelerator Circuit webbase

LP

2K x 2K Dense matrixstored in sparse format

Well Structured(sorted by nonzeros/row)

Poorly Structuredhodgepodge

Extreme Aspect Ratio(linear programming)

Page 19: Autotuning Sparse Matrix and  Structured Grid Kernels

19

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Naïve Serial Implementation

Vanilla C implementation Matrix stored in CSR

(compressed sparse row) Explored compiler options,

but only the best is presented here

x86 core delivers > 10x the performance of a Niagara2 thread0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

IBM Cell Blade (PPE)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

Page 20: Autotuning Sparse Matrix and  Structured Grid Kernels

20

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Naïve Parallel Implementation

SPMD style Partition by rows Load balance by nonzeros N2 ~ 2.5x x86 machine

0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

IBM Cell Blade (PPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

Naïve Pthreads

Naïve

Page 21: Autotuning Sparse Matrix and  Structured Grid Kernels

21

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SPMD style Partition by rows Load balance by nonzeros N2 ~ 2.5x x86 machine

Naïve Parallel Implementation

0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

IBM Cell Blade (PPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

Naïve Pthreads

Naïve

8x cores = 1.9x performance

4x cores = 1.5x performance

64x threads = 41x performance

4x threads = 3.4x performance

Page 22: Autotuning Sparse Matrix and  Structured Grid Kernels

22

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SPMD style Partition by rows Load balance by nonzeros N2 ~ 2.5x x86 machine

Naïve Parallel Implementation

0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

IBM Cell Blade (PPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

Naïve Pthreads

Naïve

1.4% of peak flops29% of bandwidth 4% of peak flops

20% of bandwidth

25% of peak flops39% of bandwidth

2.7% of peak flops4% of bandwidth

Page 23: Autotuning Sparse Matrix and  Structured Grid Kernels

23

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Autotuned Performance(+NUMA & SW Prefetching)

Use first touch, or libnuma to exploit NUMA.

Also includes process affinity.

Tag prefetches with temporal locality

Autotune: search for the optimal prefetch distances

0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

IBM Cell Blade (PPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Page 24: Autotuning Sparse Matrix and  Structured Grid Kernels

24

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Autotuned Performance(+Matrix Compression)

If memory bound, only hope is minimizing memory traffic

Heuristically compress the parallelized matrix to minimize it

Implemented with SSE Benefit of prefetching is

hidden by requirement of register blocking

Options: register blocking, index size, format, etc…

0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.53.0

3.5

4.04.5

5.0

5.5

6.0

6.5

7.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

IBM Cell Blade (PPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Page 25: Autotuning Sparse Matrix and  Structured Grid Kernels

25

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Autotuned Performance(+Cache/TLB Blocking)

Reorganize matrix to maximize locality of source vector accesses

0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.53.0

3.5

4.04.5

5.0

5.5

6.0

6.5

7.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

IBM Cell Blade (PPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Page 26: Autotuning Sparse Matrix and  Structured Grid Kernels

26

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Autotuned Performance(+DIMMs, Firmware, Padding)

Clovertown was already fully populated with DIMMs

Gave Opteron as many DIMMs as Clovertown

Firmware update for Niagara2 Array padding to avoid inter-

thread conflict misses

PPE’s use ~1/3 of Cell chip area

0.0

0.5

1.0

1.5

2.0

2.53.0

3.5

4.04.5

5.0

5.5

6.0

6.5

7.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

IBM Cell Blade (PPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

+More DIMMs(opteron), +FW fix, array padding(N2), etc…+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Page 27: Autotuning Sparse Matrix and  Structured Grid Kernels

27

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Autotuned Performance(+DIMMs, Firmware, Padding)

Clovertown was already fully populated with DIMMs

Gave Opteron as many DIMMs as Clovertown

Firmware update for Niagara2 Array padding to avoid inter-

thread conflict misses

PPE’s use ~1/3 of Cell chip area

0.0

0.5

1.0

1.5

2.0

2.53.0

3.5

4.04.5

5.0

5.5

6.0

6.5

7.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

IBM Cell Blade (PPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

+More DIMMs(opteron), +FW fix, array padding(N2), etc…+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

4% of peak flops52% of bandwidth

20% of peak flops65% of bandwidth

54% of peak flops57% of bandwidth

10% of peak flops10% of bandwidth

Page 28: Autotuning Sparse Matrix and  Structured Grid Kernels

28

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Autotuned Performance(+Cell/SPE version)

Wrote a double precision Cell/SPE version

DMA, local store blocked, NUMA aware, etc…

Only 2x1 and larger BCOO Only the SpMV-proper

routine changed

About 12x faster (median) than using the PPEs alone.

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

IBM Cell Blade (SPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

+More DIMMs(opteron), +FW fix, array padding(N2), etc…+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Page 29: Autotuning Sparse Matrix and  Structured Grid Kernels

29

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Wrote a double precision Cell/SPE version

DMA, local store blocked, NUMA aware, etc…

Only 2x1 and larger BCOO Only the SpMV-proper

routine changed

About 12x faster than using the PPEs alone.

Autotuned Performance(+Cell/SPE version)

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

IBM Cell Blade (SPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

+More DIMMs(opteron), +FW fix, array padding(N2), etc…+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

4% of peak flops52% of bandwidth

20% of peak flops65% of bandwidth

54% of peak flops57% of bandwidth 40% of peak flops

92% of bandwidth

Page 30: Autotuning Sparse Matrix and  Structured Grid Kernels

30

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Autotuned Performance(How much did double precision and 2x1 blocking hurt)

Model faster cores by commenting out the inner kernel calls, but still performing all DMAs

Enabled 1x1 BCOO

~16% improvement

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

IBM Cell Blade (SPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

+More DIMMs(opteron), +FW fix, array padding(N2), etc…+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

+better Cell implementation

Page 31: Autotuning Sparse Matrix and  Structured Grid Kernels

31

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

MPI vs. Threads On x86 machines,

autotuned(OSKI) shared memory MPICH implementation rarely scales beyond 2 threads

Still debugging MPI issues on Niagara2, but so far, it rarely scales beyond 8 threads.

Autotuned pthreads

Autotuned MPI

Naïve Serial

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-AccelCircuit

Webbase

LP

Median

GFlop/s

Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

Page 32: Autotuning Sparse Matrix and  Structured Grid Kernels

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

32

BERKELEY PAR LAB

Lattice-Boltzmann Magneto-Hydrodynamics (LBMHD)

Preliminary results

Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS) (to appear), 2008.

AutotuningMulticore SMPsHPC KernelsSpMVLBMHDSummary

Page 33: Autotuning Sparse Matrix and  Structured Grid Kernels

33

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Lattice Methods

Structured grid code, with a series of time steps Popular in CFD Allows for complex boundary conditions Higher dimensional phase space

Simplified kinetic model that maintains the macroscopic quantities Distribution functions (e.g. 27 velocities per point in space) are used to

reconstruct macroscopic quantities Significant Memory capacity requirements

14

12

4

16

13

5

9

8

212

0

25

3

1

24

22

23

2618

15

6

19

17

7

11

10

20

+Z

+Y

+X

Page 34: Autotuning Sparse Matrix and  Structured Grid Kernels

34

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD(general characteristics)

Plasma turbulence simulation Two distributions:

momentum distribution (27 components) magnetic distribution (15 vector components)

Three macroscopic quantities: Density Momentum (vector) Magnetic Field (vector)

Must read 73 doubles, and update(write) 79 doubles per point in space

Requires about 1300 floating point operations per point in space Just over 1.0 flops/byte (ideal) No temporal locality between points in space within one time step

Page 35: Autotuning Sparse Matrix and  Structured Grid Kernels

35

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD(implementation details)

Data Structure choices: Array of Structures: lacks spatial locality Structure of Arrays: huge number of memory streams per thread, but

vectorizes well

Parallelization Fortran version used MPI to communicate between nodes.

= bad match for multicore This version uses pthreads for multicore, and MPI for inter-node MPI is not used when autotuning

Two problem sizes: 643 (~330MB) 1283 (~2.5GB)

Page 36: Autotuning Sparse Matrix and  Structured Grid Kernels

36

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Pthread Implementation

Not naïve fully unrolled loops NUMA-aware 1D parallelization

Always used 8 threads per core on Niagara2

Cell version was notautotuned

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 1 2 464^3 128^3

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 8 1 2 4 8

64^3 128^3

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 8 1 2 4 864^3 128^3

GFlop/s

IBM Cell BladeSun Niagara2 (Huron)

Intel Clovertown AMD Opteron

Page 37: Autotuning Sparse Matrix and  Structured Grid Kernels

37

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Not naïve fully unrolled loops NUMA-aware 1D parallelization

Always used 8 threads per core on Niagara2

Pthread Implementation

Cell version was notautotuned

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 1 2 464^3 128^3

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 8 1 2 4 8

64^3 128^3

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 8 1 2 4 864^3 128^3

GFlop/s

IBM Cell BladeSun Niagara2 (Huron)

Intel Clovertown AMD Opteron

4.8% of peak flops16% of bandwidth 14% of peak flops

17% of bandwidth

54% of peak flops14% of bandwidth

Page 38: Autotuning Sparse Matrix and  Structured Grid Kernels

38

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Autotuned Performance(+Stencil-aware Padding)

This lattice method is essentially 79 simultaneous72-point stencils

Can cause conflict misses even with highly associative L1 caches (not to mention opteron’s 2 way)

Solution: pad each component so that when accessed with the corresponding stencil(spatial) offset, the components are uniformly distributed in the cache

Cell version was notautotuned

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 1 2 464^3 128^3

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 8 1 2 4 8

64^3 128^3

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 8 1 2 4 864^3 128^3

GFlop/s .

IBM Cell BladeSun Niagara2 (Huron)

Intel Clovertown AMD Opteron

+Padding

Naïve+NUMA

Page 39: Autotuning Sparse Matrix and  Structured Grid Kernels

39

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Autotuned Performance(+Vectorization)

Each update requires touching ~150 components, each likely to be on a different page

TLB misses can significantly impact performance

Solution: vectorization Fuse spatial loops,

strip mine into vectors of size VL, and interchange with phase dimensional loops

Autotune: search for the optimal vector length

Significant benefit on some architectures Becomes irrelevant when bandwidth

dominates performance

Cell version was notautotuned

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 8 1 2 4 8

64^3 128^3

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 1 2 464^3 128^3

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 8 1 2 4 8

64^3 128^3

GFlop/s

IBM Cell BladeSun Niagara2 (Huron)

Intel Clovertown AMD Opteron

+Vectorization

+Padding

Naïve+NUMA

Page 40: Autotuning Sparse Matrix and  Structured Grid Kernels

40

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Autotuned Performance(+Explicit Unrolling/Reordering)

Give the compilers a helping hand for the complex loops

Code Generator: Perl script to generate all power of 2 possibilities

Autotune: search for the best unrolling and expression of data level parallelism

Is essential when using SIMD intrinsics

Cell version was notautotuned

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 1 2 464^3 128^3

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 8 1 2 4 8

64^3 128^3

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 8 1 2 4 8

64^3 128^3

GFlop/s

IBM Cell BladeSun Niagara2 (Huron)

Intel Clovertown AMD Opteron

+Unrolling

+Vectorization

+Padding

Naïve+NUMA

Page 41: Autotuning Sparse Matrix and  Structured Grid Kernels

41

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Autotuned Performance(+Software prefetching)

Expanded the code generator to insert software prefetches in case the compiler doesn’t.

Autotune: no prefetch prefetch 1 line ahead prefetch 1 vector ahead.

Relatively little benefit for relatively little work

Cell version was notautotuned

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 8 1 2 4 8

64^3 128^3

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 1 2 464^3 128^3

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 8 1 2 4 8

64^3 128^3

GFlop/s

IBM Cell BladeSun Niagara2 (Huron)

Intel Clovertown AMD Opteron

+SW Prefetching

+Unrolling

+Vectorization

+Padding

Naïve+NUMA

Page 42: Autotuning Sparse Matrix and  Structured Grid Kernels

42

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Autotuned Performance(+SIMDization, including non-temporal stores)

Compilers(gcc & icc) failed at exploiting SIMD.

Expanded the code generator to use SIMD intrinsics.

Explicit unrolling/reordering was extremely valuable here.

Exploited movntpd to minimize memory traffic (only hope if memory bound)

Significant benefit for significant work

Cell version was notautotuned

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 8 1 2 4 8

64^3 128^3

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 8 1 2 4 8

64^3 128^3

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

1 2 4 1 2 464^3 128^3

GFlop/s

IBM Cell BladeSun Niagara2 (Huron)

Intel Clovertown AMD Opteron

+SIMDization

+SW Prefetching

+Unrolling

+Vectorization

+Padding

Naïve+NUMA

Page 43: Autotuning Sparse Matrix and  Structured Grid Kernels

43

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Autotuned Performance(Cell/SPE version)

First attempt at cell implementation.

VL, unrolling, reordering fixed Exploits DMA and double

buffering to load vectors Straight to SIMD intrinsics. Despite the relative

performance, Cell’s DP implementation severely impairs performance

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

1 2 4 1 2 464^3 128^3

GFlop/s

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

1 2 4 8 1 2 4 8

64^3 128^3

GFlop/s

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

1 2 4 8 1 2 4 8

64^3 128^3

GFlop/s

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

1 2 4 8 16 1 2 4 8 16

64^3 128^3

GFlop/s

Intel Clovertown AMD Opteron

Sun Niagara2 (Huron) IBM Cell Blade*

*collision() only

+SIMDization

+SW Prefetching

+Unrolling

+Vectorization

+Padding

Naïve+NUMA

Page 44: Autotuning Sparse Matrix and  Structured Grid Kernels

44

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

First attempt at cell implementation.

VL, unrolling, reordering fixed Exploits DMA and double

buffering to load vectors Straight to SIMD intrinsics. Despite the relative

performance, Cell’s DP implementation severely impairs performance

Autotuned Performance(Cell/SPE version)

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

1 2 4 1 2 464^3 128^3

GFlop/s

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

1 2 4 8 1 2 4 8

64^3 128^3

GFlop/s

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

1 2 4 8 1 2 4 8

64^3 128^3

GFlop/s

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

1 2 4 8 16 1 2 4 8 16

64^3 128^3

GFlop/s

Intel Clovertown AMD Opteron

Sun Niagara2 (Huron) IBM Cell Blade*

*collision() only

+SIMDization

+SW Prefetching

+Unrolling

+Vectorization

+Padding

Naïve+NUMA

7.5% of peak flops17% of bandwidth

42% of peak flops35% of bandwidth

59% of peak flops15% of bandwidth

57% of peak flops33% of bandwidth

Page 45: Autotuning Sparse Matrix and  Structured Grid Kernels

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

45

BERKELEY PAR LAB

Summary

AutotuningMulticore SMPsHPC KernelsSpMVLBMHDSummary

Page 46: Autotuning Sparse Matrix and  Structured Grid Kernels

46

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Aggregate Performance (Fully optimized)

Cell consistently delivers the best full system performance Niagara2 delivers comparable per socket performance Dual core Opteron delivers far better performance (bandwidth) than Clovertown, but

as the flop:byte ratio increases its performance advantage decreases. Huron has far more bandwidth than it can exploit

(too much latency, too few cores) Clovertown has far too little effective FSB bandwidth

SpMV(median)

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16cores

GFlop/s

OpteronClovertownNiagara2 (Huron)Cell Blade

LBMHD(64^3)

0.002.004.006.008.00

10.0012.0014.0016.0018.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16cores

GFlop/s

OpteronClovertownNiagara2 (Huron)Cell Blade

Page 47: Autotuning Sparse Matrix and  Structured Grid Kernels

47

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Parallel Efficiency(average performance per thread, Fully optimized)

Aggregate Mflop/s / #cores Niagara2 & Cell show very good multicore scaling Clovertown showed very poor multicore scaling on both applications For SpMV, Opteron and Clovertown showed good multisocket scaling Clovertown runs into bandwidth limits far short of its theoretical peak even for LBMHD Opteron lacks the bandwidth for SpMV, and the FP resources to use its bandwidth for

LBMHD

LBMHD(64^3)

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16cores

GFlop/s/core

OpteronClovertownNiagara2 (Huron)Cell Blade

SpMV(median)

0.000.100.200.300.400.500.600.700.800.901.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16cores

GFlop/s/core

OpteronClovertownNiagara2 (Huron)Cell Blade

Page 48: Autotuning Sparse Matrix and  Structured Grid Kernels

48

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Power Efficiency(Fully Optimized)

Used a digital power meter to measure sustained power under load Calculate power efficiency as:

sustained performance / sustained power All cache-based machines delivered similar power efficiency FBDIMMs (~12W each) sustained power

8 DIMMs on Clovertown (total of ~330W) 16 DIMMs on N2 machine (total of ~450W)

LBMHD(64^3)

0

10

20

30

40

50

60

70

Clovertown Opteron Niagara2(Huron)

Cell Blade

MFlop/s/watt

SpMV(median)

02468

1012141618202224

Clovertown Opteron Niagara2(Huron)

Cell Blade

MFlop/s/watt

Page 49: Autotuning Sparse Matrix and  Structured Grid Kernels

49

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Productivity

Niagara2 required significantly less work to deliver good performance.

For LBMHD, Clovertown, Opteron, and Cell all required SIMD (hampers productivity) for best performance.

Virtually every optimization was required (sooner or later) for Opteron and Cell.

Cache based machines required search for some optimizations, while cell always relied on heuristics

Page 50: Autotuning Sparse Matrix and  Structured Grid Kernels

50

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Summary

Paradoxically, the most complex/advanced architectures required the most tuning, and delivered the lowest performance.

Niagara2 delivered both very good performance and productivity Cell delivered very good performance and efficiency (processor and

power)

Our multicore specific autotuned SpMV implementation significantly outperformed an autotuned MPI implementation

Our multicore autotuned LBMHD implementation significantly outperformed the already optimized serial implementation

Sustainable memory bandwidth is essential even on kernels with moderate computational intensity (flop:byte ratio)

Architectural transparency is invaluable in optimizing code

Page 51: Autotuning Sparse Matrix and  Structured Grid Kernels

51

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Acknowledgements

UC Berkeley RADLab Cluster (Opterons) PSI cluster(Clovertowns)

Sun Microsystems Niagara2 access

Forschungszentrum Jülich Cell blade cluster access

Page 52: Autotuning Sparse Matrix and  Structured Grid Kernels

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

52

BERKELEY PAR LAB

Questions?

Page 53: Autotuning Sparse Matrix and  Structured Grid Kernels

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

53

BERKELEY PAR LAB

Backup Slides