auto-tuning sparse matrix kernels
DESCRIPTION
Auto-tuning Sparse Matrix Kernels. Sam Williams 1,2 Richard Vuduc 3 , Leonid Oliker 1,2 , John Shalf 2 , Katherine Yelick 1,2 , James Demmel 1,2 1 University of California Berkeley 2 Lawrence Berkeley National Laboratory 3 Georgia Institute of Technology [email protected]. Motivation. - PowerPoint PPT PresentationTRANSCRIPT
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
1
BERKELEY PAR LAB
Auto-tuning Sparse Matrix Kernels
Sam Williams1,2
Richard Vuduc3, Leonid Oliker1,2, John Shalf2, Katherine Yelick1,2,
James Demmel1,2
1University of California Berkeley2Lawrence Berkeley National Laboratory3Georgia Institute of Technology
2
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Motivation
Multicore is the de facto solution for improving peak performance for the next decade
How do we ensure this applies to sustained performance as well ?
Processor architectures are extremely diverse and compilers can rarely fully exploit them
Require a HW/SW solution that guarantees performance without completely sacrificing productivity
3
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Overview
Examine Sparse Matrix Vector Multiplication (SpMV) kernel
Present and analyze two threaded & auto-tuned implementations
Benchmarked performance across 4 diverse multicore architectures Intel Xeon (Clovertown) AMD Opteron Sun Niagara2 (Huron) IBM QS20 Cell Blade
We show Auto-tuning can significantly improve performance Cell consistently delivers good performance and efficiency Niagara2 delivers good performance and productivity
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
4
BERKELEY PAR LAB
Multicore SMPs used
5
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems
667MHz FBDIMMs667MHz FBDIMMs
Chipset (4x64b controllers)Chipset (4x64b controllers)
10.6 GB/s(write)21.3 GB/s(read)
10.6 GB/s
Core2Core2
FSB
Core2Core2 Core2Core2 Core2Core2
10.6 GB/s
Core2Core2
FSB
Core2Core2 Core2Core2 Core2Core2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
512MB XDR DRAM512MB XDR DRAM
25.6GB/s
EIB (Ring Network)EIB (Ring Network)
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
MFCMFC MFCMFC MFCMFC MFCMFC
256K256K 256K256K 256K256K 256K256K
SPESPE SPESPE SPESPE SPESPE
XDRXDR BIFBIF
PPEPPE
512KBL2
512KBL2
512MB XDR DRAM512MB XDR DRAM
25.6GB/s
EIB (Ring Network)EIB (Ring Network)
BIFBIF XDRXDR
PPEPPE
512KBL2
512KBL2
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
MFCMFC MFCMFC MFCMFC MFCMFC
256K256K 256K256K 256K256K 256K256K
SPESPE SPESPE SPESPE SPESPE
<20GB/seach
direction
IBM QS20 Cell BladeSun Niagara2 (Huron)
AMD OpteronIntel Clovertown
OpteronOpteron OpteronOpteron
667MHz DDR2 DIMMs667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller128b memory controller
HT
HT1MB
victim1MBvictim
1MBvictim1MBvictim
SRI / crossbarSRI / crossbar
OpteronOpteron OpteronOpteron
667MHz DDR2 DIMMs667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller128b memory controller
HT
HT 1MB
victim1MBvictim
1MBvictim1MBvictim
SRI / crossbarSRI / crossbar
4GB
/s(e
ach
dire
ctio
n)
Crossbar SwitchCrossbar Switch
42.66 GB/s (read)
667MHz FBDIMMs 667MHz FBDIMMs
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
21.33 GB/s (write)
179 GB/s (fill)90 GB/s (writethru)
4x128b memory controllers (2 banks each)4x128b memory controllers (2 banks each)
MTSparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
Sparc
8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1
6
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(memory hierarchy)
667MHz FBDIMMs667MHz FBDIMMs
Chipset (4x64b controllers)Chipset (4x64b controllers)
10.6 GB/s(write)21.3 GB/s(read)
10.6 GB/s
Core2Core2
FSB
Core2Core2 Core2Core2 Core2Core2
10.6 GB/s
Core2Core2
FSB
Core2Core2 Core2Core2 Core2Core2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
512MB XDR DRAM512MB XDR DRAM
25.6GB/s
EIB (Ring Network)EIB (Ring Network)
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
MFCMFC MFCMFC MFCMFC MFCMFC
256K256K 256K256K 256K256K 256K256K
SPESPE SPESPE SPESPE SPESPE
XDRXDR BIFBIF
PPEPPE
512KBL2
512KBL2
512MB XDR DRAM512MB XDR DRAM
25.6GB/s
EIB (Ring Network)EIB (Ring Network)
BIFBIF XDRXDR
PPEPPE
512KBL2
512KBL2
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
MFCMFC MFCMFC MFCMFC MFCMFC
256K256K 256K256K 256K256K 256K256K
SPESPE SPESPE SPESPE SPESPE
<20GB/seach
direction
IBM QS20 Cell BladeSun Niagara2 (Huron)
AMD OpteronIntel Clovertown
OpteronOpteron OpteronOpteron
667MHz DDR2 DIMMs667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller128b memory controller
HT
HT1MB
victim1MBvictim
1MBvictim1MBvictim
SRI / crossbarSRI / crossbar
OpteronOpteron OpteronOpteron
667MHz DDR2 DIMMs667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller128b memory controller
HT
HT 1MB
victim1MBvictim
1MBvictim1MBvictim
SRI / crossbarSRI / crossbar
4GB
/s(e
ach
dire
ctio
n)
Crossbar SwitchCrossbar Switch
42.66 GB/s (read)
667MHz FBDIMMs 667MHz FBDIMMs
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
21.33 GB/s (write)
179 GB/s (fill)90 GB/s (writethru)
4x128b memory controllers (2 banks each)4x128b memory controllers (2 banks each)
MTSparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
Sparc
8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1
Conventional Cache-based
Memory Hierarchy
Conventional Cache-based
Memory Hierarchy
7
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(memory hierarchy)
667MHz FBDIMMs667MHz FBDIMMs
Chipset (4x64b controllers)Chipset (4x64b controllers)
10.6 GB/s(write)21.3 GB/s(read)
10.6 GB/s
Core2Core2
FSB
Core2Core2 Core2Core2 Core2Core2
10.6 GB/s
Core2Core2
FSB
Core2Core2 Core2Core2 Core2Core2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
512MB XDR DRAM512MB XDR DRAM
25.6GB/s
EIB (Ring Network)EIB (Ring Network)
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
MFCMFC MFCMFC MFCMFC MFCMFC
256K256K 256K256K 256K256K 256K256K
SPESPE SPESPE SPESPE SPESPE
XDRXDR BIFBIF
PPEPPE
512KBL2
512KBL2
512MB XDR DRAM512MB XDR DRAM
25.6GB/s
EIB (Ring Network)EIB (Ring Network)
BIFBIF XDRXDR
PPEPPE
512KBL2
512KBL2
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
MFCMFC MFCMFC MFCMFC MFCMFC
256K256K 256K256K 256K256K 256K256K
SPESPE SPESPE SPESPE SPESPE
<20GB/seach
direction
IBM QS20 Cell BladeSun Niagara2 (Huron)
AMD OpteronIntel Clovertown
OpteronOpteron OpteronOpteron
667MHz DDR2 DIMMs667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller128b memory controller
HT
HT1MB
victim1MBvictim
1MBvictim1MBvictim
SRI / crossbarSRI / crossbar
OpteronOpteron OpteronOpteron
667MHz DDR2 DIMMs667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller128b memory controller
HT
HT 1MB
victim1MBvictim
1MBvictim1MBvictim
SRI / crossbarSRI / crossbar
4GB
/s(e
ach
dire
ctio
n)
Crossbar SwitchCrossbar Switch
42.66 GB/s (read)
667MHz FBDIMMs 667MHz FBDIMMs
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
21.33 GB/s (write)
179 GB/s (fill)90 GB/s (writethru)
4x128b memory controllers (2 banks each)4x128b memory controllers (2 banks each)
MTSparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
Sparc
8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1
Conventional Cache-based
Memory Hierarchy
Conventional Cache-based
Memory Hierarchy
Disjoint Local Store
Memory Hierarchy
Disjoint Local Store
Memory Hierarchy
8
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(memory hierarchy)
667MHz FBDIMMs667MHz FBDIMMs
Chipset (4x64b controllers)Chipset (4x64b controllers)
10.6 GB/s(write)21.3 GB/s(read)
10.6 GB/s
Core2Core2
FSB
Core2Core2 Core2Core2 Core2Core2
10.6 GB/s
Core2Core2
FSB
Core2Core2 Core2Core2 Core2Core2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
512MB XDR DRAM512MB XDR DRAM
25.6GB/s
EIB (Ring Network)EIB (Ring Network)
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
MFCMFC MFCMFC MFCMFC MFCMFC
256K256K 256K256K 256K256K 256K256K
SPESPE SPESPE SPESPE SPESPE
XDRXDR BIFBIF
PPEPPE
512KBL2
512KBL2
512MB XDR DRAM512MB XDR DRAM
25.6GB/s
EIB (Ring Network)EIB (Ring Network)
BIFBIF XDRXDR
PPEPPE
512KBL2
512KBL2
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
MFCMFC MFCMFC MFCMFC MFCMFC
256K256K 256K256K 256K256K 256K256K
SPESPE SPESPE SPESPE SPESPE
<20GB/seach
direction
IBM QS20 Cell BladeSun Niagara2 (Huron)
AMD OpteronIntel Clovertown
OpteronOpteron OpteronOpteron
667MHz DDR2 DIMMs667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller128b memory controller
HT
HT1MB
victim1MBvictim
1MBvictim1MBvictim
SRI / crossbarSRI / crossbar
OpteronOpteron OpteronOpteron
667MHz DDR2 DIMMs667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller128b memory controller
HT
HT 1MB
victim1MBvictim
1MBvictim1MBvictim
SRI / crossbarSRI / crossbar
4GB
/s(e
ach
dire
ctio
n)
Crossbar SwitchCrossbar Switch
42.66 GB/s (read)
667MHz FBDIMMs 667MHz FBDIMMs
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
21.33 GB/s (write)
179 GB/s (fill)90 GB/s (writethru)
4x128b memory controllers (2 banks each)4x128b memory controllers (2 banks each)
MTSparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
Sparc
8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1
Cache + Pthreads
implementationsb
Cache + Pthreads
implementationsb
Local Store + lib
spe
implementations
Local Store + lib
spe
implementations
9
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(peak flops)
667MHz FBDIMMs667MHz FBDIMMs
Chipset (4x64b controllers)Chipset (4x64b controllers)
10.6 GB/s(write)21.3 GB/s(read)
10.6 GB/s
Core2Core2
FSB
Core2Core2 Core2Core2 Core2Core2
10.6 GB/s
Core2Core2
FSB
Core2Core2 Core2Core2 Core2Core2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
512MB XDR DRAM512MB XDR DRAM
25.6GB/s
EIB (Ring Network)EIB (Ring Network)
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
MFCMFC MFCMFC MFCMFC MFCMFC
256K256K 256K256K 256K256K 256K256K
SPESPE SPESPE SPESPE SPESPE
XDRXDR BIFBIF
PPEPPE
512KBL2
512KBL2
512MB XDR DRAM512MB XDR DRAM
25.6GB/s
EIB (Ring Network)EIB (Ring Network)
BIFBIF XDRXDR
PPEPPE
512KBL2
512KBL2
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
MFCMFC MFCMFC MFCMFC MFCMFC
256K256K 256K256K 256K256K 256K256K
SPESPE SPESPE SPESPE SPESPE
<20GB/seach
direction
IBM QS20 Cell BladeSun Niagara2 (Huron)
AMD OpteronIntel Clovertown
OpteronOpteron OpteronOpteron
667MHz DDR2 DIMMs667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller128b memory controller
HT
HT1MB
victim1MBvictim
1MBvictim1MBvictim
SRI / crossbarSRI / crossbar
OpteronOpteron OpteronOpteron
667MHz DDR2 DIMMs667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller128b memory controller
HT
HT 1MB
victim1MBvictim
1MBvictim1MBvictim
SRI / crossbarSRI / crossbar
4GB
/s(e
ach
dire
ctio
n)
Crossbar SwitchCrossbar Switch
42.66 GB/s (read)
667MHz FBDIMMs 667MHz FBDIMMs
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
21.33 GB/s (write)
179 GB/s (fill)90 GB/s (writethru)
4x128b memory controllers (2 banks each)4x128b memory controllers (2 banks each)
MTSparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
Sparc
8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1
75 Gflop/s 17 Gflop/s
PPEs: 13 Gflop/s
SPEs: 29 Gflop/s11 Gflop/s
10
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems(peak DRAM bandwidth)
667MHz FBDIMMs667MHz FBDIMMs
Chipset (4x64b controllers)Chipset (4x64b controllers)
10.6 GB/s(write)21.3 GB/s(read)
10.6 GB/s
Core2Core2
FSB
Core2Core2 Core2Core2 Core2Core2
10.6 GB/s
Core2Core2
FSB
Core2Core2 Core2Core2 Core2Core2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
512MB XDR DRAM512MB XDR DRAM
25.6GB/s
EIB (Ring Network)EIB (Ring Network)
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
MFCMFC MFCMFC MFCMFC MFCMFC
256K256K 256K256K 256K256K 256K256K
SPESPE SPESPE SPESPE SPESPE
XDRXDR BIFBIF
PPEPPE
512KBL2
512KBL2
512MB XDR DRAM512MB XDR DRAM
25.6GB/s
EIB (Ring Network)EIB (Ring Network)
BIFBIF XDRXDR
PPEPPE
512KBL2
512KBL2
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
MFCMFC MFCMFC MFCMFC MFCMFC
256K256K 256K256K 256K256K 256K256K
SPESPE SPESPE SPESPE SPESPE
<20GB/seach
direction
IBM QS20 Cell BladeSun Niagara2 (Huron)
AMD OpteronIntel Clovertown
OpteronOpteron OpteronOpteron
667MHz DDR2 DIMMs667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller128b memory controller
HT
HT1MB
victim1MBvictim
1MBvictim1MBvictim
SRI / crossbarSRI / crossbar
OpteronOpteron OpteronOpteron
667MHz DDR2 DIMMs667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller128b memory controller
HT
HT 1MB
victim1MBvictim
1MBvictim1MBvictim
SRI / crossbarSRI / crossbar
4GB
/s(e
ach
dire
ctio
n)
Crossbar SwitchCrossbar Switch
42.66 GB/s (read)
667MHz FBDIMMs 667MHz FBDIMMs
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
21.33 GB/s (write)
179 GB/s (fill)90 GB/s (writethru)
4x128b memory controllers (2 banks each)4x128b memory controllers (2 banks each)
MTSparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
Sparc
8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1
21 GB/s(read)
10 GB/s(write)21 GB/s
51 GB/s42 GB/s(read)
21 GB/s(write)
11
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMP Systems
667MHz FBDIMMs667MHz FBDIMMs
Chipset (4x64b controllers)Chipset (4x64b controllers)
10.6 GB/s(write)21.3 GB/s(read)
10.6 GB/s
Core2Core2
FSB
Core2Core2 Core2Core2 Core2Core2
10.6 GB/s
Core2Core2
FSB
Core2Core2 Core2Core2 Core2Core2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
4MBShared L2
512MB XDR DRAM512MB XDR DRAM
25.6GB/s
EIB (Ring Network)EIB (Ring Network)
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
MFCMFC MFCMFC MFCMFC MFCMFC
256K256K 256K256K 256K256K 256K256K
SPESPE SPESPE SPESPE SPESPE
XDRXDR BIFBIF
PPEPPE
512KBL2
512KBL2
512MB XDR DRAM512MB XDR DRAM
25.6GB/s
EIB (Ring Network)EIB (Ring Network)
BIFBIF XDRXDR
PPEPPE
512KBL2
512KBL2
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
SPESPE
256K256K
MFCMFC
MFCMFC MFCMFC MFCMFC MFCMFC
256K256K 256K256K 256K256K 256K256K
SPESPE SPESPE SPESPE SPESPE
<20GB/seach
direction
IBM QS20 Cell BladeSun Niagara2 (Huron)
AMD OpteronIntel Clovertown
OpteronOpteron OpteronOpteron
667MHz DDR2 DIMMs667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller128b memory controller
HT
HT1MB
victim1MBvictim
1MBvictim1MBvictim
SRI / crossbarSRI / crossbar
OpteronOpteron OpteronOpteron
667MHz DDR2 DIMMs667MHz DDR2 DIMMs
10.66 GB/s
128b memory controller128b memory controller
HT
HT 1MB
victim1MBvictim
1MBvictim1MBvictim
SRI / crossbarSRI / crossbar
4GB
/s(e
ach
dire
ctio
n)
Crossbar SwitchCrossbar Switch
42.66 GB/s (read)
667MHz FBDIMMs 667MHz FBDIMMs
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
4MB Shared L2 (16 way)(address interleaving via 8x64B banks)
21.33 GB/s (write)
179 GB/s (fill)90 GB/s (writethru)
4x128b memory controllers (2 banks each)4x128b memory controllers (2 banks each)
MTSparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
SparcMT
Sparc
8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1 8K L18K L1
Uni
form
Mem
ory
Acc
ess
Uni
form
Mem
ory
Acc
ess
Non-U
nifo
rm M
emor
y Acc
ess
Non-U
nifo
rm M
emor
y Acc
ess
12
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Arithmetic Intensity
Arithmetic Intensity ~ Total Flops / Total DRAM Bytes Some HPC kernels have an arithmetic intensity that scales with with
problem size (increasing temporal locality) But there are many important and interesting kernels that don’t
A r i t h m e t i c I n t e n s i t y
O( N ) O( log(N) ) O( 1 )
SpMV, BLAS1,2
Stencils (PDEs)
Lattice Methods
FFTsDense Linear Algebra
(BLAS3)Particle Methods
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
13
BERKELEY PAR LAB
Auto-tuning
14
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuning
Hand optimizing each architecture/dataset combination is not feasible
Goal: Productive Solution for Performance portability
Our auto-tuning approach finds a good performance solution by a combination of heuristics and exhaustive search Perl script generates many possible kernels (Generate SIMD optimized kernels) Auto-tuning benchmark examines kernels and reports back with the
best one for the current architecture/dataset/compiler/… Performance depends on the optimizations generated Heuristics are often desirable when the search space isn’t tractable
Proven value in Dense Linear Algebra(ATLAS), Spectral(FFTW,SPIRAL), and Sparse Methods(OSKI)
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
15
BERKELEY PAR LAB
Sparse Matrix-Vector Multiplication (SpMV)
Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.
16
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Sparse MatrixVector Multiplication
Sparse Matrix Most entries are 0.0 Performance advantage in only
storing/operating on the nonzeros Requires significant meta data
Evaluate y=Ax A is a sparse matrix x & y are dense vectors
Challenges Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD) Irregular memory access to source vector Difficult to load balance Very low computational intensity (often >6 bytes/flop)
= likely memory bound
A x y
17
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Dataset (Matrices)
Pruned original SPARSITY suite down to 14 none should fit in cache Subdivided them into 4 categories Rank ranges from 2K to 1M
Dense
ProteinFEM /
SpheresFEM /
CantileverWind
TunnelFEM /Harbor
QCDFEM /Ship
Economics Epidemiology
FEM /Accelerator
Circuit webbase
LP
2K x 2K Dense matrixstored in sparse format
Well Structured(sorted by nonzeros/row)
Poorly Structuredhodgepodge
Extreme Aspect Ratio(linear programming)
18
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Naïve Serial Implementation
Vanilla C implementation Matrix stored in CSR
(compressed sparse row) Explored compiler options,
but only the best is presented here
x86 core delivers > 10x the performance of a Niagara2 thread0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
IBM Cell Blade (PPE)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
19
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Naïve Parallel Implementation
SPMD style Partition by rows Load balance by nonzeros N2 ~ 2.5x x86 machine
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
IBM Cell Blade (PPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
Naïve Pthreads
Naïve
20
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SPMD style Partition by rows Load balance by nonzeros N2 ~ 2.5x x86 machine
Naïve Parallel Implementation
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
IBM Cell Blade (PPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
Naïve Pthreads
Naïve
8x cores = 1.9x performance8x cores = 1.9x performance
4x cores = 1.5x performance4x cores = 1.5x performance
64x threads = 41x performance64x threads = 41x performance
4x threads = 3.4x performance4x threads = 3.4x performance
21
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SPMD style Partition by rows Load balance by nonzeros N2 ~ 2.5x x86 machine
Naïve Parallel Implementation
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
IBM Cell Blade (PPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
Naïve Pthreads
Naïve
1.4% of peak flops
29% of bandwidth
1.4% of peak flops
29% of bandwidth 4% of peak flops
20% of bandwidth
4% of peak flops
20% of bandwidth
25% of peak flops
39% of bandwidth
25% of peak flops
39% of bandwidth
2.7% of peak flops
4% of bandwidth
2.7% of peak flops
4% of bandwidth
22
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuned Performance(+NUMA & SW Prefetching)
Use first touch, or libnuma to exploit NUMA.
Also includes process affinity.
Tag prefetches with temporal locality
Auto-tune: search for the optimal prefetch distances
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
IBM Cell Blade (PPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
23
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuned Performance(+Matrix Compression)
If memory bound, only hope is minimizing memory traffic
Heuristically compress the parallelized matrix to minimize it
Implemented with SSE Benefit of prefetching is
hidden by requirement of register blocking
Options: register blocking, index size, format, etc…
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
IBM Cell Blade (PPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
+Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
24
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuned Performance(+Cache/TLB Blocking)
Reorganize matrix to maximize locality of source vector accesses
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
IBM Cell Blade (PPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
+Cache/TLB Blocking
+Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
25
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuned Performance(+DIMMs, Firmware, Padding)
Clovertown was already fully populated with DIMMs
Gave Opteron as many DIMMs as Clovertown
Firmware update for Niagara2 Array padding to avoid inter-
thread conflict misses
PPE’s use ~1/3 of Cell chip area
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
IBM Cell Blade (PPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
+More DIMMs(opteron), +FW fix, array padding(N2), etc…
+Cache/TLB Blocking
+Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
26
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuned Performance(+DIMMs, Firmware, Padding)
Clovertown was already fully populated with DIMMs
Gave Opteron as many DIMMs as Clovertown
Firmware update for Niagara2 Array padding to avoid inter-
thread conflict misses
PPE’s use ~1/3 of Cell chip area
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
IBM Cell Blade (PPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
+More DIMMs(opteron), +FW fix, array padding(N2), etc…
+Cache/TLB Blocking
+Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
4% of peak flops
52% of bandwidth
4% of peak flops
52% of bandwidth20% of peak flops
65% of bandwidth
20% of peak flops
65% of bandwidth
54% of peak flops
57% of bandwidth
54% of peak flops
57% of bandwidth
10% of peak flops
10% of bandwidth
10% of peak flops
10% of bandwidth
27
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuned Performance(+Cell/SPE version)
Wrote a double precision Cell/SPE version
DMA, local store blocked, NUMA aware, etc…
Only 2x1 and larger BCOO Only the SpMV-proper
routine changed
About 12x faster (median) than using the PPEs alone.
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
IBM Cell Blade (SPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
+More DIMMs(opteron), +FW fix, array padding(N2), etc…
+Cache/TLB Blocking
+Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
28
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Wrote a double precision Cell/SPE version
DMA, local store blocked, NUMA aware, etc…
Only 2x1 and larger BCOO Only the SpMV-proper
routine changed
About 12x faster than using the PPEs alone.
Auto-tuned Performance(+Cell/SPE version)
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
IBM Cell Blade (SPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
+More DIMMs(opteron), +FW fix, array padding(N2), etc…
+Cache/TLB Blocking
+Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
4% of peak flops
52% of bandwidth
4% of peak flops
52% of bandwidth20% of peak flops
65% of bandwidth
20% of peak flops
65% of bandwidth
54% of peak flops
57% of bandwidth
54% of peak flops
57% of bandwidth 40% of peak flops
92% of bandwidth
40% of peak flops
92% of bandwidth
29
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuned Performance(How much did double precision and 2x1 blocking hurt)
Model faster cores by commenting out the inner kernel calls, but still performing all DMAs
Enabled 1x1 BCOO
~16% improvement
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
IBM Cell Blade (SPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
+More DIMMs(opteron), +FW fix, array padding(N2), etc…
+Cache/TLB Blocking
+Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
+better Cell implementation
30
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Wrote a double precision Cell/SPE version
DMA, local store blocked, NUMA aware, etc…
Only 2x1 and larger BCOO Only the SpMV-proper routine
changed
About 12x faster than using the PPEs alone.
Speedup from Auto-tuningMedian & (max)
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
DenseProteinFEM-SphrFEM-Cant
Tunnel
FEM-Harbor
QCD
FEM-Ship
EconEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
IBM Cell Blade (SPEs)Sun Niagara2 (Huron)
AMD OpteronIntel Clovertown
+More DIMMs(opteron), +FW fix, array padding(N2), etc…
+Cache/TLB Blocking
+Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
1.3x(2.9x)
1.3x(2.9x)
26x(34x)
26x(34x)
3.9x(4.4x)
3.9x(4.4x)
1.6x(2.7x)
1.6x(2.7x)
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
31
BERKELEY PAR LAB
Summary
32
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Aggregate Performance (Fully optimized)
Cell consistently delivers the best full system performance Although, Niagara2 delivers near comparable per socket performance
Dual core Opteron delivers far better performance (bandwidth) than Clovertown Clovertown has far too little effective FSB bandwidth Huron has far more bandwidth than it can exploit
(too much latency, too few cores)
SpMV(median)
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
cores
GFlop/s
OpteronClovertownNiagara2 (Huron)Cell Blade
33
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Parallel Efficiency(average performance per thread, Fully optimized)
Aggregate Mflop/s / #cores Niagara2 & Cell showed very good multicore scaling Clovertown showed very poor multicore scaling on both applications For SpMV, Opteron and Clovertown showed good multisocket
scaling
SpMV(median)
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
cores
GFlop/s/core
OpteronClovertownNiagara2 (Huron)Cell Blade
34
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Power Efficiency(Fully Optimized)
Used a digital power meter to measure sustained power under load Calculate power efficiency as:
sustained performance / sustained power All cache-based machines delivered similar power efficiency FBDIMMs (~12W each) sustained power
8 DIMMs on Clovertown (total of ~330W) 16 DIMMs on N2 machine (total of ~450W)
SpMV(median)
02468
1012141618202224
Clovertown Opteron Niagara2(Huron)
Cell Blade
MFlop/s/watt
35
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Productivity
Niagara2 required significantly less work to deliver good performance.
Cache based machines required search for some optimizations, while Cell relied solely on heuristics (less time to tune)
36
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Summary
Paradoxically, the most complex/advanced architectures required the most tuning, and delivered the lowest performance.
Niagara2 delivered both very good performance and productivity Cell delivered very good performance and efficiency (processor and
power)
Our multicore specific auto-tuned SpMV implementation significantly outperformed existing parallelization strategies including an auto-tuned MPI implementation (as discussed @SC07)
Architectural transparency is invaluable in optimizing code
37
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Acknowledgements
UC Berkeley RADLab Cluster (Opterons) PSI cluster(Clovertowns)
Sun Microsystems Niagara2 donations
Forschungszentrum Jülich Cell blade cluster access
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
38
BERKELEY PAR LAB
Questions?
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
39
BERKELEY PAR LAB
switch to pOSKI