auto-tuning sparse matrix kernels

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

1

BERKELEY PAR LAB

Auto-tuning Sparse Matrix Kernels

Sam Williams1,2

Richard Vuduc3, Leonid Oliker1,2, John Shalf2, Katherine Yelick1,2,

James Demmel1,2

1University of California Berkeley2Lawrence Berkeley National Laboratory3Georgia Institute of Technology

[email protected]

2


Computer Sciences BERKELEY PAR LAB

Motivation

Multicore is the de facto solution for improving peak performance for the next decade

How do we ensure this applies to sustained performance as well ?

Processor architectures are extremely diverse and compilers can rarely fully exploit them

Require a HW/SW solution that guarantees performance without completely sacrificing productivity

3



Overview

Examine Sparse Matrix Vector Multiplication (SpMV) kernel

Present and analyze two threaded & auto-tuned implementations

Benchmarked performance across 4 diverse multicore architectures Intel Xeon (Clovertown) AMD Opteron Sun Niagara2 (Huron) IBM QS20 Cell Blade

We show Auto-tuning can significantly improve performance Cell consistently delivers good performance and efficiency Niagara2 delivers good performance and productivity



Computer Sciences

4

BERKELEY PAR LAB

Multicore SMPs used

5



Multicore SMP Systems

667MHz FBDIMMs667MHz FBDIMMs

Chipset (4x64b controllers)Chipset (4x64b controllers)

10.6 GB/s(write)21.3 GB/s(read)

10.6 GB/s

Core2Core2

FSB

Core2Core2 Core2Core2 Core2Core2

10.6 GB/s

Core2Core2

FSB


4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

512MB XDR DRAM512MB XDR DRAM

25.6GB/s

EIB (Ring Network)EIB (Ring Network)

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

MFCMFC MFCMFC MFCMFC MFCMFC

256K256K 256K256K 256K256K 256K256K

SPESPE SPESPE SPESPE SPESPE

XDRXDR BIFBIF

PPEPPE

512KBL2

512KBL2


25.6GB/s


BIFBIF XDRXDR

PPEPPE

512KBL2

512KBL2

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC


256K256K 256K256K 256K256K 256K256K


<20GB/seach

direction

IBM QS20 Cell BladeSun Niagara2 (Huron)

AMD OpteronIntel Clovertown

OpteronOpteron OpteronOpteron

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller128b memory controller

HT

HT1MB

victim1MBvictim

1MBvictim1MBvictim

SRI / crossbarSRI / crossbar



10.66 GB/s


HT

HT 1MB

victim1MBvictim

1MBvictim1MBvictim


4GB

/s(e

ach

dire

ctio

n)

Crossbar SwitchCrossbar Switch

42.66 GB/s (read)

667MHz FBDIMMs 667MHz FBDIMMs

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)


21.33 GB/s (write)

179 GB/s (fill)90 GB/s (writethru)

4x128b memory controllers (2 banks each)4x128b memory controllers (2 banks each)

MTSparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

Sparc

8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1

6



Multicore SMP Systems(memory hierarchy)




10.6 GB/s

Core2Core2

FSB


10.6 GB/s

Core2Core2

FSB


4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2


25.6GB/s


SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC


256K256K 256K256K 256K256K 256K256K


XDRXDR BIFBIF

PPEPPE

512KBL2

512KBL2


25.6GB/s


BIFBIF XDRXDR

PPEPPE

512KBL2

512KBL2

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC


256K256K 256K256K 256K256K 256K256K


<20GB/seach

direction





10.66 GB/s


HT

HT1MB

victim1MBvictim

1MBvictim1MBvictim




10.66 GB/s


HT

HT 1MB

victim1MBvictim

1MBvictim1MBvictim


4GB

/s(e

ach

dire

ctio

n)


42.66 GB/s (read)




21.33 GB/s (write)



MTSparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

Sparc


Conventional Cache-based

Memory Hierarchy


Memory Hierarchy

7







10.6 GB/s

Core2Core2

FSB


10.6 GB/s

Core2Core2

FSB


4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2


25.6GB/s


SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC


256K256K 256K256K 256K256K 256K256K


XDRXDR BIFBIF

PPEPPE

512KBL2

512KBL2


25.6GB/s


BIFBIF XDRXDR

PPEPPE

512KBL2

512KBL2

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC


256K256K 256K256K 256K256K 256K256K


<20GB/seach

direction





10.66 GB/s


HT

HT1MB

victim1MBvictim

1MBvictim1MBvictim




10.66 GB/s


HT

HT 1MB

victim1MBvictim

1MBvictim1MBvictim


4GB

/s(e

ach

dire

ctio

n)


42.66 GB/s (read)




21.33 GB/s (write)



MTSparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

Sparc



Memory Hierarchy


Memory Hierarchy

Disjoint Local Store

Memory Hierarchy

Disjoint Local Store

Memory Hierarchy

8







10.6 GB/s

Core2Core2

FSB


10.6 GB/s

Core2Core2

FSB


4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2


25.6GB/s


SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC


256K256K 256K256K 256K256K 256K256K


XDRXDR BIFBIF

PPEPPE

512KBL2

512KBL2


25.6GB/s


BIFBIF XDRXDR

PPEPPE

512KBL2

512KBL2

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC


256K256K 256K256K 256K256K 256K256K


<20GB/seach

direction





10.66 GB/s


HT

HT1MB

victim1MBvictim

1MBvictim1MBvictim




10.66 GB/s


HT

HT 1MB

victim1MBvictim

1MBvictim1MBvictim


4GB

/s(e

ach

dire

ctio

n)


42.66 GB/s (read)




21.33 GB/s (write)



MTSparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

Sparc


Cache + Pthreads

implementationsb

Cache + Pthreads

implementationsb

Local Store + lib

spe

implementations

Local Store + lib

spe

implementations

9



Multicore SMP Systems(peak flops)




10.6 GB/s

Core2Core2

FSB


10.6 GB/s

Core2Core2

FSB


4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2


25.6GB/s


SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC


256K256K 256K256K 256K256K 256K256K


XDRXDR BIFBIF

PPEPPE

512KBL2

512KBL2


25.6GB/s


BIFBIF XDRXDR

PPEPPE

512KBL2

512KBL2

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC


256K256K 256K256K 256K256K 256K256K


<20GB/seach

direction





10.66 GB/s


HT

HT1MB

victim1MBvictim

1MBvictim1MBvictim




10.66 GB/s


HT

HT 1MB

victim1MBvictim

1MBvictim1MBvictim


4GB

/s(e

ach

dire

ctio

n)


42.66 GB/s (read)




21.33 GB/s (write)



MTSparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

Sparc


75 Gflop/s 17 Gflop/s

PPEs: 13 Gflop/s

SPEs: 29 Gflop/s11 Gflop/s

10



Multicore SMP Systems(peak DRAM bandwidth)




10.6 GB/s

Core2Core2

FSB


10.6 GB/s

Core2Core2

FSB


4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2


25.6GB/s


SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC


256K256K 256K256K 256K256K 256K256K


XDRXDR BIFBIF

PPEPPE

512KBL2

512KBL2


25.6GB/s


BIFBIF XDRXDR

PPEPPE

512KBL2

512KBL2

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC


256K256K 256K256K 256K256K 256K256K


<20GB/seach

direction





10.66 GB/s


HT

HT1MB

victim1MBvictim

1MBvictim1MBvictim




10.66 GB/s


HT

HT 1MB

victim1MBvictim

1MBvictim1MBvictim


4GB

/s(e

ach

dire

ctio

n)


42.66 GB/s (read)




21.33 GB/s (write)



MTSparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

Sparc


21 GB/s(read)

10 GB/s(write)21 GB/s

51 GB/s42 GB/s(read)

21 GB/s(write)

11



Multicore SMP Systems




10.6 GB/s

Core2Core2

FSB


10.6 GB/s

Core2Core2

FSB


4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2


25.6GB/s


SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC


256K256K 256K256K 256K256K 256K256K


XDRXDR BIFBIF

PPEPPE

512KBL2

512KBL2


25.6GB/s


BIFBIF XDRXDR

PPEPPE

512KBL2

512KBL2

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC


256K256K 256K256K 256K256K 256K256K


<20GB/seach

direction





10.66 GB/s


HT

HT1MB

victim1MBvictim

1MBvictim1MBvictim




10.66 GB/s


HT

HT 1MB

victim1MBvictim

1MBvictim1MBvictim


4GB

/s(e

ach

dire

ctio

n)


42.66 GB/s (read)




21.33 GB/s (write)



MTSparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

Sparc


Uni

form

Mem

ory

Acc

ess

Uni

form

Mem

ory

Acc

ess

Non-U

nifo

rm M

emor

y Acc

ess

Non-U

nifo

rm M

emor

y Acc

ess

12



Arithmetic Intensity

Arithmetic Intensity ~ Total Flops / Total DRAM Bytes Some HPC kernels have an arithmetic intensity that scales with with

problem size (increasing temporal locality) But there are many important and interesting kernels that don’t

A r i t h m e t i c I n t e n s i t y

O( N ) O( log(N) ) O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTsDense Linear Algebra

(BLAS3)Particle Methods



Computer Sciences

13

BERKELEY PAR LAB

Auto-tuning

14



Auto-tuning

Hand optimizing each architecture/dataset combination is not feasible

Goal: Productive Solution for Performance portability

Our auto-tuning approach finds a good performance solution by a combination of heuristics and exhaustive search Perl script generates many possible kernels (Generate SIMD optimized kernels) Auto-tuning benchmark examines kernels and reports back with the

best one for the current architecture/dataset/compiler/… Performance depends on the optimizations generated Heuristics are often desirable when the search space isn’t tractable

Proven value in Dense Linear Algebra(ATLAS), Spectral(FFTW,SPIRAL), and Sparse Methods(OSKI)



Computer Sciences

15

BERKELEY PAR LAB

Sparse Matrix-Vector Multiplication (SpMV)

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.

16



Sparse MatrixVector Multiplication

Sparse Matrix Most entries are 0.0 Performance advantage in only

storing/operating on the nonzeros Requires significant meta data

Evaluate y=Ax A is a sparse matrix x & y are dense vectors

Challenges Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD) Irregular memory access to source vector Difficult to load balance Very low computational intensity (often >6 bytes/flop)

= likely memory bound

A x y

17



Dataset (Matrices)

Pruned original SPARSITY suite down to 14 none should fit in cache Subdivided them into 4 categories Rank ranges from 2K to 1M

Dense

ProteinFEM /

SpheresFEM /

CantileverWind

TunnelFEM /Harbor

QCDFEM /Ship

Economics Epidemiology

FEM /Accelerator

Circuit webbase

LP

2K x 2K Dense matrixstored in sparse format

Well Structured(sorted by nonzeros/row)

Poorly Structuredhodgepodge

Extreme Aspect Ratio(linear programming)

18



Naïve Serial Implementation

Vanilla C implementation Matrix stored in CSR

(compressed sparse row) Explored compiler options,

but only the best is presented here

x86 core delivers > 10x the performance of a Niagara2 thread0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

IBM Cell Blade (PPE)Sun Niagara2 (Huron)


19



Naïve Parallel Implementation

SPMD style Partition by rows Load balance by nonzeros N2 ~ 2.5x x86 machine

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

IBM Cell Blade (PPEs)Sun Niagara2 (Huron)


Naïve Pthreads

Naïve

20





0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s



Naïve Pthreads

Naïve

8x cores = 1.9x performance8x cores = 1.9x performance

4x cores = 1.5x performance4x cores = 1.5x performance

64x threads = 41x performance64x threads = 41x performance

4x threads = 3.4x performance4x threads = 3.4x performance

21





0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s



Naïve Pthreads

Naïve

1.4% of peak flops

29% of bandwidth

1.4% of peak flops

29% of bandwidth 4% of peak flops

20% of bandwidth

4% of peak flops

20% of bandwidth

25% of peak flops

39% of bandwidth

25% of peak flops

39% of bandwidth

2.7% of peak flops

4% of bandwidth

2.7% of peak flops

4% of bandwidth

22



Auto-tuned Performance(+NUMA & SW Prefetching)

Use first touch, or libnuma to exploit NUMA.

Also includes process affinity.

Tag prefetches with temporal locality

Auto-tune: search for the optimal prefetch distances

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s



+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

23



Auto-tuned Performance(+Matrix Compression)

If memory bound, only hope is minimizing memory traffic

Heuristically compress the parallelized matrix to minimize it

Implemented with SSE Benefit of prefetching is

hidden by requirement of register blocking

Options: register blocking, index size, format, etc…

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s



+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

24



Auto-tuned Performance(+Cache/TLB Blocking)

Reorganize matrix to maximize locality of source vector accesses

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s



+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

25



Auto-tuned Performance(+DIMMs, Firmware, Padding)

Clovertown was already fully populated with DIMMs

Gave Opteron as many DIMMs as Clovertown

Firmware update for Niagara2 Array padding to avoid inter-

thread conflict misses

PPE’s use ~1/3 of Cell chip area

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s



+More DIMMs(opteron), +FW fix, array padding(N2), etc…

+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

26



Auto-tuned Performance(+DIMMs, Firmware, Padding)

Clovertown was already fully populated with DIMMs

Gave Opteron as many DIMMs as Clovertown

Firmware update for Niagara2 Array padding to avoid inter-

thread conflict misses

PPE’s use ~1/3 of Cell chip area

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s




+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

4% of peak flops

52% of bandwidth

4% of peak flops

52% of bandwidth20% of peak flops

65% of bandwidth

20% of peak flops

65% of bandwidth

54% of peak flops

57% of bandwidth

54% of peak flops

57% of bandwidth

10% of peak flops

10% of bandwidth

10% of peak flops

10% of bandwidth

27



Auto-tuned Performance(+Cell/SPE version)

Wrote a double precision Cell/SPE version

DMA, local store blocked, NUMA aware, etc…

Only 2x1 and larger BCOO Only the SpMV-proper

routine changed

About 12x faster (median) than using the PPEs alone.

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

IBM Cell Blade (SPEs)Sun Niagara2 (Huron)



+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

28





Only 2x1 and larger BCOO Only the SpMV-proper

routine changed

About 12x faster than using the PPEs alone.

Auto-tuned Performance(+Cell/SPE version)

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s




+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

4% of peak flops

52% of bandwidth

4% of peak flops

52% of bandwidth20% of peak flops

65% of bandwidth

20% of peak flops

65% of bandwidth

54% of peak flops

57% of bandwidth

54% of peak flops

57% of bandwidth 40% of peak flops

92% of bandwidth

40% of peak flops

92% of bandwidth

29



Auto-tuned Performance(How much did double precision and 2x1 blocking hurt)

Model faster cores by commenting out the inner kernel calls, but still performing all DMAs

Enabled 1x1 BCOO

~16% improvement

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s




+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

+better Cell implementation

30





Only 2x1 and larger BCOO Only the SpMV-proper routine

changed

About 12x faster than using the PPEs alone.

Speedup from Auto-tuningMedian & (max)

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0


Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s




+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

1.3x(2.9x)

1.3x(2.9x)

26x(34x)

26x(34x)

3.9x(4.4x)

3.9x(4.4x)

1.6x(2.7x)

1.6x(2.7x)



Computer Sciences

31

BERKELEY PAR LAB

Summary

32



Aggregate Performance (Fully optimized)

Cell consistently delivers the best full system performance Although, Niagara2 delivers near comparable per socket performance

Dual core Opteron delivers far better performance (bandwidth) than Clovertown Clovertown has far too little effective FSB bandwidth Huron has far more bandwidth than it can exploit

(too much latency, too few cores)

SpMV(median)

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

cores

GFlop/s

OpteronClovertownNiagara2 (Huron)Cell Blade

33



Parallel Efficiency(average performance per thread, Fully optimized)

Aggregate Mflop/s / #cores Niagara2 & Cell showed very good multicore scaling Clovertown showed very poor multicore scaling on both applications For SpMV, Opteron and Clovertown showed good multisocket

scaling

SpMV(median)

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

cores

GFlop/s/core

OpteronClovertownNiagara2 (Huron)Cell Blade

34



Power Efficiency(Fully Optimized)

Used a digital power meter to measure sustained power under load Calculate power efficiency as:

sustained performance / sustained power All cache-based machines delivered similar power efficiency FBDIMMs (~12W each) sustained power

8 DIMMs on Clovertown (total of ~330W) 16 DIMMs on N2 machine (total of ~450W)

SpMV(median)

02468

1012141618202224

Clovertown Opteron Niagara2(Huron)

Cell Blade

MFlop/s/watt

35



Productivity

Niagara2 required significantly less work to deliver good performance.

Cache based machines required search for some optimizations, while Cell relied solely on heuristics (less time to tune)

36



Summary

Paradoxically, the most complex/advanced architectures required the most tuning, and delivered the lowest performance.

Niagara2 delivered both very good performance and productivity Cell delivered very good performance and efficiency (processor and

power)

Our multicore specific auto-tuned SpMV implementation significantly outperformed existing parallelization strategies including an auto-tuned MPI implementation (as discussed @SC07)

Architectural transparency is invaluable in optimizing code

37



Acknowledgements

UC Berkeley RADLab Cluster (Opterons) PSI cluster(Clovertowns)

Sun Microsystems Niagara2 donations

Forschungszentrum Jülich Cell blade cluster access



Computer Sciences

38

BERKELEY PAR LAB

Questions?



Computer Sciences

39

BERKELEY PAR LAB

switch to pOSKI

auto-tuning sparse matrix kernels

Documents