auto-tuning sparse matrix kernels

39
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR LAB Auto-tuning Sparse Matrix Kernels Sam Williams 1,2 Richard Vuduc 3 , Leonid Oliker 1,2 , John Shalf 2 , Katherine Yelick 1,2 , James Demmel 1,2 1 University of California Berkeley 2 Lawrence Berkeley National Laboratory 3 Georgia Institute of Technology

Upload: sunila

Post on 22-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Auto-tuning Sparse Matrix Kernels. Sam Williams 1,2 Richard Vuduc 3 , Leonid Oliker 1,2 , John Shalf 2 , Katherine Yelick 1,2 , James Demmel 1,2 1 University of California Berkeley 2 Lawrence Berkeley National Laboratory 3 Georgia Institute of Technology [email protected]. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Auto-tuning Sparse Matrix Kernels

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

1

BERKELEY PAR LAB

Auto-tuning Sparse Matrix Kernels

Sam Williams1,2

Richard Vuduc3, Leonid Oliker1,2, John Shalf2, Katherine Yelick1,2,

James Demmel1,2

1University of California Berkeley2Lawrence Berkeley National Laboratory3Georgia Institute of Technology

[email protected]

Page 2: Auto-tuning Sparse Matrix Kernels

2

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Motivation

Multicore is the de facto solution for improving peak performance for the next decade

How do we ensure this applies to sustained performance as well ?

Processor architectures are extremely diverse and compilers can rarely fully exploit them

Require a HW/SW solution that guarantees performance without completely sacrificing productivity

Page 3: Auto-tuning Sparse Matrix Kernels

3

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Overview

Examine Sparse Matrix Vector Multiplication (SpMV) kernel

Present and analyze two threaded & auto-tuned implementations

Benchmarked performance across 4 diverse multicore architectures Intel Xeon (Clovertown) AMD Opteron Sun Niagara2 (Huron) IBM QS20 Cell Blade

We show Auto-tuning can significantly improve performance Cell consistently delivers good performance and efficiency Niagara2 delivers good performance and productivity

Page 4: Auto-tuning Sparse Matrix Kernels

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

4

BERKELEY PAR LAB

Multicore SMPs used

Page 5: Auto-tuning Sparse Matrix Kernels

5

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems

667MHz FBDIMMs667MHz FBDIMMs

Chipset (4x64b controllers)Chipset (4x64b controllers)

10.6 GB/s(write)21.3 GB/s(read)

10.6 GB/s

Core2Core2

FSB

Core2Core2 Core2Core2 Core2Core2

10.6 GB/s

Core2Core2

FSB

Core2Core2 Core2Core2 Core2Core2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

512MB XDR DRAM512MB XDR DRAM

25.6GB/s

EIB (Ring Network)EIB (Ring Network)

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

MFCMFC MFCMFC MFCMFC MFCMFC

256K256K 256K256K 256K256K 256K256K

SPESPE SPESPE SPESPE SPESPE

XDRXDR BIFBIF

PPEPPE

512KBL2

512KBL2

512MB XDR DRAM512MB XDR DRAM

25.6GB/s

EIB (Ring Network)EIB (Ring Network)

BIFBIF XDRXDR

PPEPPE

512KBL2

512KBL2

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

MFCMFC MFCMFC MFCMFC MFCMFC

256K256K 256K256K 256K256K 256K256K

SPESPE SPESPE SPESPE SPESPE

<20GB/seach

direction

IBM QS20 Cell BladeSun Niagara2 (Huron)

AMD OpteronIntel Clovertown

OpteronOpteron OpteronOpteron

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller128b memory controller

HT

HT1MB

victim1MBvictim

1MBvictim1MBvictim

SRI / crossbarSRI / crossbar

OpteronOpteron OpteronOpteron

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller128b memory controller

HT

HT 1MB

victim1MBvictim

1MBvictim1MBvictim

SRI / crossbarSRI / crossbar

4GB

/s(e

ach

dire

ctio

n)

Crossbar SwitchCrossbar Switch

42.66 GB/s (read)

667MHz FBDIMMs 667MHz FBDIMMs

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

21.33 GB/s (write)

179 GB/s (fill)90 GB/s (writethru)

4x128b memory controllers (2 banks each)4x128b memory controllers (2 banks each)

MTSparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

Sparc

8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1

Page 6: Auto-tuning Sparse Matrix Kernels

6

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(memory hierarchy)

667MHz FBDIMMs667MHz FBDIMMs

Chipset (4x64b controllers)Chipset (4x64b controllers)

10.6 GB/s(write)21.3 GB/s(read)

10.6 GB/s

Core2Core2

FSB

Core2Core2 Core2Core2 Core2Core2

10.6 GB/s

Core2Core2

FSB

Core2Core2 Core2Core2 Core2Core2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

512MB XDR DRAM512MB XDR DRAM

25.6GB/s

EIB (Ring Network)EIB (Ring Network)

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

MFCMFC MFCMFC MFCMFC MFCMFC

256K256K 256K256K 256K256K 256K256K

SPESPE SPESPE SPESPE SPESPE

XDRXDR BIFBIF

PPEPPE

512KBL2

512KBL2

512MB XDR DRAM512MB XDR DRAM

25.6GB/s

EIB (Ring Network)EIB (Ring Network)

BIFBIF XDRXDR

PPEPPE

512KBL2

512KBL2

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

MFCMFC MFCMFC MFCMFC MFCMFC

256K256K 256K256K 256K256K 256K256K

SPESPE SPESPE SPESPE SPESPE

<20GB/seach

direction

IBM QS20 Cell BladeSun Niagara2 (Huron)

AMD OpteronIntel Clovertown

OpteronOpteron OpteronOpteron

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller128b memory controller

HT

HT1MB

victim1MBvictim

1MBvictim1MBvictim

SRI / crossbarSRI / crossbar

OpteronOpteron OpteronOpteron

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller128b memory controller

HT

HT 1MB

victim1MBvictim

1MBvictim1MBvictim

SRI / crossbarSRI / crossbar

4GB

/s(e

ach

dire

ctio

n)

Crossbar SwitchCrossbar Switch

42.66 GB/s (read)

667MHz FBDIMMs 667MHz FBDIMMs

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

21.33 GB/s (write)

179 GB/s (fill)90 GB/s (writethru)

4x128b memory controllers (2 banks each)4x128b memory controllers (2 banks each)

MTSparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

Sparc

8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1

Conventional Cache-based

Memory Hierarchy

Conventional Cache-based

Memory Hierarchy

Page 7: Auto-tuning Sparse Matrix Kernels

7

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(memory hierarchy)

667MHz FBDIMMs667MHz FBDIMMs

Chipset (4x64b controllers)Chipset (4x64b controllers)

10.6 GB/s(write)21.3 GB/s(read)

10.6 GB/s

Core2Core2

FSB

Core2Core2 Core2Core2 Core2Core2

10.6 GB/s

Core2Core2

FSB

Core2Core2 Core2Core2 Core2Core2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

512MB XDR DRAM512MB XDR DRAM

25.6GB/s

EIB (Ring Network)EIB (Ring Network)

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

MFCMFC MFCMFC MFCMFC MFCMFC

256K256K 256K256K 256K256K 256K256K

SPESPE SPESPE SPESPE SPESPE

XDRXDR BIFBIF

PPEPPE

512KBL2

512KBL2

512MB XDR DRAM512MB XDR DRAM

25.6GB/s

EIB (Ring Network)EIB (Ring Network)

BIFBIF XDRXDR

PPEPPE

512KBL2

512KBL2

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

MFCMFC MFCMFC MFCMFC MFCMFC

256K256K 256K256K 256K256K 256K256K

SPESPE SPESPE SPESPE SPESPE

<20GB/seach

direction

IBM QS20 Cell BladeSun Niagara2 (Huron)

AMD OpteronIntel Clovertown

OpteronOpteron OpteronOpteron

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller128b memory controller

HT

HT1MB

victim1MBvictim

1MBvictim1MBvictim

SRI / crossbarSRI / crossbar

OpteronOpteron OpteronOpteron

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller128b memory controller

HT

HT 1MB

victim1MBvictim

1MBvictim1MBvictim

SRI / crossbarSRI / crossbar

4GB

/s(e

ach

dire

ctio

n)

Crossbar SwitchCrossbar Switch

42.66 GB/s (read)

667MHz FBDIMMs 667MHz FBDIMMs

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

21.33 GB/s (write)

179 GB/s (fill)90 GB/s (writethru)

4x128b memory controllers (2 banks each)4x128b memory controllers (2 banks each)

MTSparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

Sparc

8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1

Conventional Cache-based

Memory Hierarchy

Conventional Cache-based

Memory Hierarchy

Disjoint Local Store

Memory Hierarchy

Disjoint Local Store

Memory Hierarchy

Page 8: Auto-tuning Sparse Matrix Kernels

8

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(memory hierarchy)

667MHz FBDIMMs667MHz FBDIMMs

Chipset (4x64b controllers)Chipset (4x64b controllers)

10.6 GB/s(write)21.3 GB/s(read)

10.6 GB/s

Core2Core2

FSB

Core2Core2 Core2Core2 Core2Core2

10.6 GB/s

Core2Core2

FSB

Core2Core2 Core2Core2 Core2Core2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

512MB XDR DRAM512MB XDR DRAM

25.6GB/s

EIB (Ring Network)EIB (Ring Network)

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

MFCMFC MFCMFC MFCMFC MFCMFC

256K256K 256K256K 256K256K 256K256K

SPESPE SPESPE SPESPE SPESPE

XDRXDR BIFBIF

PPEPPE

512KBL2

512KBL2

512MB XDR DRAM512MB XDR DRAM

25.6GB/s

EIB (Ring Network)EIB (Ring Network)

BIFBIF XDRXDR

PPEPPE

512KBL2

512KBL2

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

MFCMFC MFCMFC MFCMFC MFCMFC

256K256K 256K256K 256K256K 256K256K

SPESPE SPESPE SPESPE SPESPE

<20GB/seach

direction

IBM QS20 Cell BladeSun Niagara2 (Huron)

AMD OpteronIntel Clovertown

OpteronOpteron OpteronOpteron

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller128b memory controller

HT

HT1MB

victim1MBvictim

1MBvictim1MBvictim

SRI / crossbarSRI / crossbar

OpteronOpteron OpteronOpteron

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller128b memory controller

HT

HT 1MB

victim1MBvictim

1MBvictim1MBvictim

SRI / crossbarSRI / crossbar

4GB

/s(e

ach

dire

ctio

n)

Crossbar SwitchCrossbar Switch

42.66 GB/s (read)

667MHz FBDIMMs 667MHz FBDIMMs

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

21.33 GB/s (write)

179 GB/s (fill)90 GB/s (writethru)

4x128b memory controllers (2 banks each)4x128b memory controllers (2 banks each)

MTSparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

Sparc

8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1

Cache + Pthreads

implementationsb

Cache + Pthreads

implementationsb

Local Store + lib

spe

implementations

Local Store + lib

spe

implementations

Page 9: Auto-tuning Sparse Matrix Kernels

9

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(peak flops)

667MHz FBDIMMs667MHz FBDIMMs

Chipset (4x64b controllers)Chipset (4x64b controllers)

10.6 GB/s(write)21.3 GB/s(read)

10.6 GB/s

Core2Core2

FSB

Core2Core2 Core2Core2 Core2Core2

10.6 GB/s

Core2Core2

FSB

Core2Core2 Core2Core2 Core2Core2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

512MB XDR DRAM512MB XDR DRAM

25.6GB/s

EIB (Ring Network)EIB (Ring Network)

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

MFCMFC MFCMFC MFCMFC MFCMFC

256K256K 256K256K 256K256K 256K256K

SPESPE SPESPE SPESPE SPESPE

XDRXDR BIFBIF

PPEPPE

512KBL2

512KBL2

512MB XDR DRAM512MB XDR DRAM

25.6GB/s

EIB (Ring Network)EIB (Ring Network)

BIFBIF XDRXDR

PPEPPE

512KBL2

512KBL2

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

MFCMFC MFCMFC MFCMFC MFCMFC

256K256K 256K256K 256K256K 256K256K

SPESPE SPESPE SPESPE SPESPE

<20GB/seach

direction

IBM QS20 Cell BladeSun Niagara2 (Huron)

AMD OpteronIntel Clovertown

OpteronOpteron OpteronOpteron

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller128b memory controller

HT

HT1MB

victim1MBvictim

1MBvictim1MBvictim

SRI / crossbarSRI / crossbar

OpteronOpteron OpteronOpteron

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller128b memory controller

HT

HT 1MB

victim1MBvictim

1MBvictim1MBvictim

SRI / crossbarSRI / crossbar

4GB

/s(e

ach

dire

ctio

n)

Crossbar SwitchCrossbar Switch

42.66 GB/s (read)

667MHz FBDIMMs 667MHz FBDIMMs

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

21.33 GB/s (write)

179 GB/s (fill)90 GB/s (writethru)

4x128b memory controllers (2 banks each)4x128b memory controllers (2 banks each)

MTSparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

Sparc

8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1

75 Gflop/s 17 Gflop/s

PPEs: 13 Gflop/s

SPEs: 29 Gflop/s11 Gflop/s

Page 10: Auto-tuning Sparse Matrix Kernels

10

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(peak DRAM bandwidth)

667MHz FBDIMMs667MHz FBDIMMs

Chipset (4x64b controllers)Chipset (4x64b controllers)

10.6 GB/s(write)21.3 GB/s(read)

10.6 GB/s

Core2Core2

FSB

Core2Core2 Core2Core2 Core2Core2

10.6 GB/s

Core2Core2

FSB

Core2Core2 Core2Core2 Core2Core2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

512MB XDR DRAM512MB XDR DRAM

25.6GB/s

EIB (Ring Network)EIB (Ring Network)

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

MFCMFC MFCMFC MFCMFC MFCMFC

256K256K 256K256K 256K256K 256K256K

SPESPE SPESPE SPESPE SPESPE

XDRXDR BIFBIF

PPEPPE

512KBL2

512KBL2

512MB XDR DRAM512MB XDR DRAM

25.6GB/s

EIB (Ring Network)EIB (Ring Network)

BIFBIF XDRXDR

PPEPPE

512KBL2

512KBL2

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

MFCMFC MFCMFC MFCMFC MFCMFC

256K256K 256K256K 256K256K 256K256K

SPESPE SPESPE SPESPE SPESPE

<20GB/seach

direction

IBM QS20 Cell BladeSun Niagara2 (Huron)

AMD OpteronIntel Clovertown

OpteronOpteron OpteronOpteron

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller128b memory controller

HT

HT1MB

victim1MBvictim

1MBvictim1MBvictim

SRI / crossbarSRI / crossbar

OpteronOpteron OpteronOpteron

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller128b memory controller

HT

HT 1MB

victim1MBvictim

1MBvictim1MBvictim

SRI / crossbarSRI / crossbar

4GB

/s(e

ach

dire

ctio

n)

Crossbar SwitchCrossbar Switch

42.66 GB/s (read)

667MHz FBDIMMs 667MHz FBDIMMs

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

21.33 GB/s (write)

179 GB/s (fill)90 GB/s (writethru)

4x128b memory controllers (2 banks each)4x128b memory controllers (2 banks each)

MTSparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

Sparc

8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1

21 GB/s(read)

10 GB/s(write)21 GB/s

51 GB/s42 GB/s(read)

21 GB/s(write)

Page 11: Auto-tuning Sparse Matrix Kernels

11

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems

667MHz FBDIMMs667MHz FBDIMMs

Chipset (4x64b controllers)Chipset (4x64b controllers)

10.6 GB/s(write)21.3 GB/s(read)

10.6 GB/s

Core2Core2

FSB

Core2Core2 Core2Core2 Core2Core2

10.6 GB/s

Core2Core2

FSB

Core2Core2 Core2Core2 Core2Core2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

4MBShared L2

512MB XDR DRAM512MB XDR DRAM

25.6GB/s

EIB (Ring Network)EIB (Ring Network)

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

MFCMFC MFCMFC MFCMFC MFCMFC

256K256K 256K256K 256K256K 256K256K

SPESPE SPESPE SPESPE SPESPE

XDRXDR BIFBIF

PPEPPE

512KBL2

512KBL2

512MB XDR DRAM512MB XDR DRAM

25.6GB/s

EIB (Ring Network)EIB (Ring Network)

BIFBIF XDRXDR

PPEPPE

512KBL2

512KBL2

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

SPESPE

256K256K

MFCMFC

MFCMFC MFCMFC MFCMFC MFCMFC

256K256K 256K256K 256K256K 256K256K

SPESPE SPESPE SPESPE SPESPE

<20GB/seach

direction

IBM QS20 Cell BladeSun Niagara2 (Huron)

AMD OpteronIntel Clovertown

OpteronOpteron OpteronOpteron

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller128b memory controller

HT

HT1MB

victim1MBvictim

1MBvictim1MBvictim

SRI / crossbarSRI / crossbar

OpteronOpteron OpteronOpteron

667MHz DDR2 DIMMs667MHz DDR2 DIMMs

10.66 GB/s

128b memory controller128b memory controller

HT

HT 1MB

victim1MBvictim

1MBvictim1MBvictim

SRI / crossbarSRI / crossbar

4GB

/s(e

ach

dire

ctio

n)

Crossbar SwitchCrossbar Switch

42.66 GB/s (read)

667MHz FBDIMMs 667MHz FBDIMMs

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

4MB Shared L2 (16 way)(address interleaving via 8x64B banks)

21.33 GB/s (write)

179 GB/s (fill)90 GB/s (writethru)

4x128b memory controllers (2 banks each)4x128b memory controllers (2 banks each)

MTSparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

SparcMT

Sparc

8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1

Uni

form

Mem

ory

Acc

ess

Uni

form

Mem

ory

Acc

ess

Non-U

nifo

rm M

emor

y Acc

ess

Non-U

nifo

rm M

emor

y Acc

ess

Page 12: Auto-tuning Sparse Matrix Kernels

12

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Arithmetic Intensity

Arithmetic Intensity ~ Total Flops / Total DRAM Bytes Some HPC kernels have an arithmetic intensity that scales with with

problem size (increasing temporal locality) But there are many important and interesting kernels that don’t

A r i t h m e t i c I n t e n s i t y

O( N ) O( log(N) ) O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTsDense Linear Algebra

(BLAS3)Particle Methods

Page 13: Auto-tuning Sparse Matrix Kernels

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

13

BERKELEY PAR LAB

Auto-tuning

Page 14: Auto-tuning Sparse Matrix Kernels

14

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuning

Hand optimizing each architecture/dataset combination is not feasible

Goal: Productive Solution for Performance portability

Our auto-tuning approach finds a good performance solution by a combination of heuristics and exhaustive search Perl script generates many possible kernels (Generate SIMD optimized kernels) Auto-tuning benchmark examines kernels and reports back with the

best one for the current architecture/dataset/compiler/… Performance depends on the optimizations generated Heuristics are often desirable when the search space isn’t tractable

Proven value in Dense Linear Algebra(ATLAS), Spectral(FFTW,SPIRAL), and Sparse Methods(OSKI)

Page 15: Auto-tuning Sparse Matrix Kernels

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

15

BERKELEY PAR LAB

Sparse Matrix-Vector Multiplication (SpMV)

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.

Page 16: Auto-tuning Sparse Matrix Kernels

16

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Sparse MatrixVector Multiplication

Sparse Matrix Most entries are 0.0 Performance advantage in only

storing/operating on the nonzeros Requires significant meta data

Evaluate y=Ax A is a sparse matrix x & y are dense vectors

Challenges Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD) Irregular memory access to source vector Difficult to load balance Very low computational intensity (often >6 bytes/flop)

= likely memory bound

A x y

Page 17: Auto-tuning Sparse Matrix Kernels

17

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Dataset (Matrices)

Pruned original SPARSITY suite down to 14 none should fit in cache Subdivided them into 4 categories Rank ranges from 2K to 1M

Dense

ProteinFEM /

SpheresFEM /

CantileverWind

TunnelFEM /Harbor

QCDFEM /Ship

Economics Epidemiology

FEM /Accelerator

Circuit webbase

LP

2K x 2K Dense matrixstored in sparse format

Well Structured(sorted by nonzeros/row)

Poorly Structuredhodgepodge

Extreme Aspect Ratio(linear programming)

Page 18: Auto-tuning Sparse Matrix Kernels

18

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Naïve Serial Implementation

Vanilla C implementation Matrix stored in CSR

(compressed sparse row) Explored compiler options,

but only the best is presented here

x86 core delivers > 10x the performance of a Niagara2 thread0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

IBM Cell Blade (PPE)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

Page 19: Auto-tuning Sparse Matrix Kernels

19

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Naïve Parallel Implementation

SPMD style Partition by rows Load balance by nonzeros N2 ~ 2.5x x86 machine

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

IBM Cell Blade (PPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

Naïve Pthreads

Naïve

Page 20: Auto-tuning Sparse Matrix Kernels

20

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SPMD style Partition by rows Load balance by nonzeros N2 ~ 2.5x x86 machine

Naïve Parallel Implementation

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

IBM Cell Blade (PPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

Naïve Pthreads

Naïve

8x cores = 1.9x performance8x cores = 1.9x performance

4x cores = 1.5x performance4x cores = 1.5x performance

64x threads = 41x performance64x threads = 41x performance

4x threads = 3.4x performance4x threads = 3.4x performance

Page 21: Auto-tuning Sparse Matrix Kernels

21

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SPMD style Partition by rows Load balance by nonzeros N2 ~ 2.5x x86 machine

Naïve Parallel Implementation

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

IBM Cell Blade (PPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

Naïve Pthreads

Naïve

1.4% of peak flops

29% of bandwidth

1.4% of peak flops

29% of bandwidth 4% of peak flops

20% of bandwidth

4% of peak flops

20% of bandwidth

25% of peak flops

39% of bandwidth

25% of peak flops

39% of bandwidth

2.7% of peak flops

4% of bandwidth

2.7% of peak flops

4% of bandwidth

Page 22: Auto-tuning Sparse Matrix Kernels

22

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuned Performance(+NUMA & SW Prefetching)

Use first touch, or libnuma to exploit NUMA.

Also includes process affinity.

Tag prefetches with temporal locality

Auto-tune: search for the optimal prefetch distances

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

IBM Cell Blade (PPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Page 23: Auto-tuning Sparse Matrix Kernels

23

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuned Performance(+Matrix Compression)

If memory bound, only hope is minimizing memory traffic

Heuristically compress the parallelized matrix to minimize it

Implemented with SSE Benefit of prefetching is

hidden by requirement of register blocking

Options: register blocking, index size, format, etc…

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

IBM Cell Blade (PPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Page 24: Auto-tuning Sparse Matrix Kernels

24

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuned Performance(+Cache/TLB Blocking)

Reorganize matrix to maximize locality of source vector accesses

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

IBM Cell Blade (PPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Page 25: Auto-tuning Sparse Matrix Kernels

25

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuned Performance(+DIMMs, Firmware, Padding)

Clovertown was already fully populated with DIMMs

Gave Opteron as many DIMMs as Clovertown

Firmware update for Niagara2 Array padding to avoid inter-

thread conflict misses

PPE’s use ~1/3 of Cell chip area

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

IBM Cell Blade (PPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

+More DIMMs(opteron), +FW fix, array padding(N2), etc…

+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Page 26: Auto-tuning Sparse Matrix Kernels

26

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuned Performance(+DIMMs, Firmware, Padding)

Clovertown was already fully populated with DIMMs

Gave Opteron as many DIMMs as Clovertown

Firmware update for Niagara2 Array padding to avoid inter-

thread conflict misses

PPE’s use ~1/3 of Cell chip area

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

IBM Cell Blade (PPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

+More DIMMs(opteron), +FW fix, array padding(N2), etc…

+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

4% of peak flops

52% of bandwidth

4% of peak flops

52% of bandwidth20% of peak flops

65% of bandwidth

20% of peak flops

65% of bandwidth

54% of peak flops

57% of bandwidth

54% of peak flops

57% of bandwidth

10% of peak flops

10% of bandwidth

10% of peak flops

10% of bandwidth

Page 27: Auto-tuning Sparse Matrix Kernels

27

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuned Performance(+Cell/SPE version)

Wrote a double precision Cell/SPE version

DMA, local store blocked, NUMA aware, etc…

Only 2x1 and larger BCOO Only the SpMV-proper

routine changed

About 12x faster (median) than using the PPEs alone.

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

IBM Cell Blade (SPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

+More DIMMs(opteron), +FW fix, array padding(N2), etc…

+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Page 28: Auto-tuning Sparse Matrix Kernels

28

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Wrote a double precision Cell/SPE version

DMA, local store blocked, NUMA aware, etc…

Only 2x1 and larger BCOO Only the SpMV-proper

routine changed

About 12x faster than using the PPEs alone.

Auto-tuned Performance(+Cell/SPE version)

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

IBM Cell Blade (SPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

+More DIMMs(opteron), +FW fix, array padding(N2), etc…

+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

4% of peak flops

52% of bandwidth

4% of peak flops

52% of bandwidth20% of peak flops

65% of bandwidth

20% of peak flops

65% of bandwidth

54% of peak flops

57% of bandwidth

54% of peak flops

57% of bandwidth 40% of peak flops

92% of bandwidth

40% of peak flops

92% of bandwidth

Page 29: Auto-tuning Sparse Matrix Kernels

29

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuned Performance(How much did double precision and 2x1 blocking hurt)

Model faster cores by commenting out the inner kernel calls, but still performing all DMAs

Enabled 1x1 BCOO

~16% improvement

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

IBM Cell Blade (SPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

+More DIMMs(opteron), +FW fix, array padding(N2), etc…

+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

+better Cell implementation

Page 30: Auto-tuning Sparse Matrix Kernels

30

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Wrote a double precision Cell/SPE version

DMA, local store blocked, NUMA aware, etc…

Only 2x1 and larger BCOO Only the SpMV-proper routine

changed

About 12x faster than using the PPEs alone.

Speedup from Auto-tuningMedian & (max)

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

12.0

DenseProteinFEM-SphrFEM-Cant

Tunnel

FEM-Harbor

QCD

FEM-Ship

EconEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

IBM Cell Blade (SPEs)Sun Niagara2 (Huron)

AMD OpteronIntel Clovertown

+More DIMMs(opteron), +FW fix, array padding(N2), etc…

+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

1.3x(2.9x)

1.3x(2.9x)

26x(34x)

26x(34x)

3.9x(4.4x)

3.9x(4.4x)

1.6x(2.7x)

1.6x(2.7x)

Page 31: Auto-tuning Sparse Matrix Kernels

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

31

BERKELEY PAR LAB

Summary

Page 32: Auto-tuning Sparse Matrix Kernels

32

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Aggregate Performance (Fully optimized)

Cell consistently delivers the best full system performance Although, Niagara2 delivers near comparable per socket performance

Dual core Opteron delivers far better performance (bandwidth) than Clovertown Clovertown has far too little effective FSB bandwidth Huron has far more bandwidth than it can exploit

(too much latency, too few cores)

SpMV(median)

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

cores

GFlop/s

OpteronClovertownNiagara2 (Huron)Cell Blade

Page 33: Auto-tuning Sparse Matrix Kernels

33

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Parallel Efficiency(average performance per thread, Fully optimized)

Aggregate Mflop/s / #cores Niagara2 & Cell showed very good multicore scaling Clovertown showed very poor multicore scaling on both applications For SpMV, Opteron and Clovertown showed good multisocket

scaling

SpMV(median)

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

cores

GFlop/s/core

OpteronClovertownNiagara2 (Huron)Cell Blade

Page 34: Auto-tuning Sparse Matrix Kernels

34

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Power Efficiency(Fully Optimized)

Used a digital power meter to measure sustained power under load Calculate power efficiency as:

sustained performance / sustained power All cache-based machines delivered similar power efficiency FBDIMMs (~12W each) sustained power

8 DIMMs on Clovertown (total of ~330W) 16 DIMMs on N2 machine (total of ~450W)

SpMV(median)

02468

1012141618202224

Clovertown Opteron Niagara2(Huron)

Cell Blade

MFlop/s/watt

Page 35: Auto-tuning Sparse Matrix Kernels

35

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Productivity

Niagara2 required significantly less work to deliver good performance.

Cache based machines required search for some optimizations, while Cell relied solely on heuristics (less time to tune)

Page 36: Auto-tuning Sparse Matrix Kernels

36

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Summary

Paradoxically, the most complex/advanced architectures required the most tuning, and delivered the lowest performance.

Niagara2 delivered both very good performance and productivity Cell delivered very good performance and efficiency (processor and

power)

Our multicore specific auto-tuned SpMV implementation significantly outperformed existing parallelization strategies including an auto-tuned MPI implementation (as discussed @SC07)

Architectural transparency is invaluable in optimizing code

Page 37: Auto-tuning Sparse Matrix Kernels

37

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Acknowledgements

UC Berkeley RADLab Cluster (Opterons) PSI cluster(Clovertowns)

Sun Microsystems Niagara2 donations

Forschungszentrum Jülich Cell blade cluster access

Page 38: Auto-tuning Sparse Matrix Kernels

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

38

BERKELEY PAR LAB

Questions?

Page 39: Auto-tuning Sparse Matrix Kernels

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

39

BERKELEY PAR LAB

switch to pOSKI