tuning sparse matrix vector multiplication for multi-core processors

40
Tuning Sparse Matrix Vector Multiplication for multi-core processors Sam Williams [email protected]

Upload: mahala

Post on 11-Jan-2016

42 views

Category:

Documents


3 download

DESCRIPTION

Tuning Sparse Matrix Vector Multiplication for multi-core processors. Sam Williams [email protected]. Other Contributors. Rich Vuduc Lenny Oliker John Shalf Kathy Yelick Jim Demmel. Outline. Introduction Machines / Matrices Initial performance Tuning Optimized Performance. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Tuning Sparse Matrix Vector Multiplication for multi-core

processors

Sam Williams

[email protected]

Page 2: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Other Contributors

• Rich Vuduc

• Lenny Oliker

• John Shalf

• Kathy Yelick

• Jim Demmel

Page 3: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Outline

• Introduction

• Machines / Matrices

• Initial performance

• Tuning

• Optimized Performance

Page 4: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Introduction

• Evaluate y=Ax, where x & y are dense vectors, and A is a sparse matrix.

• Sparse implies most elements are zero, and thus do not need to be stored.

• Storing just the nonzeros requires their value and meta data containing their coordinate.

Page 5: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Multi-core trends

• It will be far easier to scale peak gflop/s (via multi-core) than peak GB/s (more pins/higher frequency)

• With sufficient number of cores, any low computational intensity kernel should be memory bound.

• Thus, the problems with the smallest footprint should run the fastest.

• Thus, tuning via heuristics, instead of search becomes more tractable

Page 6: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Which multi-core processors?

Page 7: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Intel Xeon(Clovertown)

• pseudo quad-core / socket

• 4-issue, out of order, super-scalar

• Fully pumped SSE

• 2.33GHz

• 21GB/s, 74.6 GFlop/s

4MBShared L2

Xeon

FSB

Fully Buffered DRAM

10.6GB/s

Xeon

Blackford

10.6GB/s

10.6 GB/s(write)

4MBShared L2

Xeon Xeon

4MBShared L2

Xeon

FSB

Xeon

4MBShared L2

Xeon Xeon

21.3 GB/s(read)

Page 8: Tuning Sparse Matrix Vector Multiplication for multi-core processors

AMD Opteron

• Dual core / socket• 3-issue, out of order,

super-scalar• Half pumped SSE

• 2.2GHz

• 21GB/s, 17.6 GFlop/s

• Strong NUMA issues

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

1MBvictim

Opteron

1MBvictim

Opteron

Memory Controller / HT

DDR2 DRAM DDR2 DRAM

10.6GB/s 10.6GB/s

8GB/s

Page 9: Tuning Sparse Matrix Vector Multiplication for multi-core processors

IBM Cell Blade

• Eight SPEs / socket• Dual-issue, in-order,

VLIW like• SIMD only ISA• Disjoint Local Store

address space + DMA• Weak DP FPU• 51.2GB/s, 29.2GFlop/s,• Strong NUMA issues

XDR DRAM

25.6GB/s

EIB

(Ring

Netw

ork)

<<20GB/seach

direction

SPE256K

PPE512K L2

MFC

BIF

XDR

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

SPE256KMFC

XDR DRAM

25.6GB/s

EIB

(Ring

Netw

ork)

SPE 256K

PPE 512K L2

MFC

BIF

XDR

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

SPE 256K MFC

Page 10: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Sun Niagara

• Eight core• Each core is 4 way

multithreaded• Single-issue in-order• Shared, very slow FPU• 1.0GHz• 25.6GB/s, 0.1GFlop/s,

(8GIPS)

Cro

ssb

ar

Sw

itch

25

.6 G

B/s

DD

R2

DR

AM

3M

B S

hare

d L

2 (

12

wa

y)

8K D$MT UltraSparc

FPU

8K D$MT UltraSparc

8K D$MT UltraSparc

8K D$MT UltraSparc

8K D$MT UltraSparc

8K D$MT UltraSparc

8K D$MT UltraSparc

8K D$MT UltraSparc

64

GB

/s(f

ill)

32

GB

/s(w

rite

thru

)

Page 11: Tuning Sparse Matrix Vector Multiplication for multi-core processors

64b integers on Niagara?

• To provide an interesting benchmark, Niagara was run using 64b integer arithmetic.

• This makes the peak “Gflop/s” in the ballpark of the other architectures.

• Downside: integer multiplies are not pipelined, and require 10 cycles. Increased register pressure

• Perhaps a rough approximation to Niagara2 performance and scalability

• Cell’s double precision isn’t poor enough to necessitate this work around.

Page 12: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Niagara2

• 1.0 -> 1.4GHz• Pipeline redesign• 2 thread groups/core (2x the ALUs, 2x the threads)• FPU per core• FBDIMM (42.6GB/s read, 21.3 write)

• Sun’s Claims:– Increasing threads per core from 4 to 8 to deliver up to 64 simultaneous

threads in a single Niagara 2 processor, resulting in at least 2x throughput of the current UltraSPARC T1 processor-- all within the same power and thermal envelope

– The integration of one Floating Point Unit per core, rather than one per processor, to deliver 10X higher throughput on applications with high floating point content such as scientific, technical, simulation, and modeling programs

Page 13: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Which matrices?Name Dense

2K

ProteinFEM /

SpheresFEM /

CantileverWind

TunnelFEM /Harbor

QCDFEM /Ship Economics

EpidemiologyFEM /

AcceleratorCircuit webbase LP

4.0M(2K)

36K

4.3M(119)

83K

6.0M(72)

62K

4.0M(65)

218K

11.6M(53)

47K

2.37M(50)

49K

1.90M(39)

141K

3.98M(28)

207K

1.27M(6)

526K

2.1M(4)

121K

2.62M(22)

171K

959K(6)

1M

3.1M(3)

4K

11.3M(2825)

Den

se m

atrix

insp

arse

for

mat

Pro

tein

dat

aba

nk 1

HY

S

FE

M c

once

ntric

sphe

res

FE

M c

antil

ever

Pre

ssur

ized

win

d tu

nnel

3D C

FD

of

Cha

rlest

on h

arbo

r

Qua

rk p

ropa

gato

rs(Q

CD

/LG

T)

FE

M S

hip

sect

ion/

deta

il

Mac

roec

onom

icm

odel

2D M

arko

v m

odel

of e

pide

mic

Acc

eler

ator

cav

ityde

sign

Mot

orol

a ci

rcui

tsi

mul

atio

n

Web

con

nect

ivity

mat

rix

Rai

lway

s se

t co

ver

Con

stra

int

mat

rix

2K 36K 83K 62K 218K 47K 49K 141K 207K 526K 121K 171K 1M 1.1M

Rows

Columns

Nonzeros(per row)

Des

crip

tion

Spyplot

Page 14: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Un-tuned Serial & Parallel Performance

Page 15: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Intel Clovertown

• 8 way parallelism typically delivered only 66% improvement

Intel Clovertown

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

2 Sockets x 4 Cores[naïve]

1 Socket x 1 Core[naïve]

Page 16: Tuning Sparse Matrix Vector Multiplication for multi-core processors

AMD Opteron

• 4 way parallelism typically delivered only 40% improvement in performance

AMD X2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Dense ProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Dual Socket x 2 Core[naïve]

1 socket x 1 Core[naïve]

Page 17: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Sun Niagara

• 32 way parallelism typically delivered 23x performance improvement

Sun Niagara

0.00

0.20

0.40

0.60

0.80

1.00

1.20

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

8 Cores x 4 Threads[Naïve]

1 Core x 1 Thread[Naïve]

Page 18: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Tuning

Page 19: Tuning Sparse Matrix Vector Multiplication for multi-core processors

OSKI & PETSc

• OSKI is a serial auto-tuning library for sparse matrix operations

• Much of the OSKI tuning space is included here.

• For parallelism, it can be included in the PETSc parallel library using a shared memory version of MPICH

• We include these 2 points as a baseline for the x86 machines.

Page 20: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Exploit NUMA

• Partition the matrix into disjoint sub-matrices (thread blocks)

• Use NUMA facilities to pin both thread block and process

• x86 Linux sched_setaffinity()(libnuma ran slow)

• Niagara Solaris processor_bind()• Cell libnuma - numa_set_membind()

- numa_bind()

Page 21: Tuning Sparse Matrix Vector Multiplication for multi-core processors

• The mapping of linux/solaris processor ID to physical core/thread ID was unique to each machine

• Important if you want to share cacheor accurately benchmark single socket/core

• Opteron• Clovertown• Niagara

Processor ID to Physical ID

Core(bit 0)

Socket(bit 1)

Socket(bit 0)

Thread within a core(bits 2:0)

Core within a socket(bits 5:3)

Core in socket(bits 2:1)

Socket(bit 6)

Page 22: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Fast Barriers

• Pthread barrier wasn’t available on x86 machines• Emulate it with mutexes & broadcasts (sun’s

example)• Acceptable performance with low number of threads

• Pthread barrier on Solaris doesn’t scale well & emulation is even slower

• Implement lock free barrier (thread 0 sets others free)• Similar version on Cell (PPE sets SPEs free)

Page 23: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Cell Local Store Blocking

• Heuristic Approach• Applied individually to each thread block• Allocate ~half the local store to caching 128byte lines of

the source vector and the other half for the destination• Thus partitions a thread block into multiple blocked rows• Each is in turn blocked by marking the unique cache

lines touched expressed in stanzas • Create a DMA list, and compress the column indices to

be cache block relative.• Limited by maximum stanza size and max number of

stanzas

Page 24: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Cache Blocking

• Local Store blocking can be extended to caches, but you don’t get an explicit list or compressed indices

• Different than standard cache blocking for SPMV (lines spanned vs touched)

• Beneficial on some matrices

Page 25: Tuning Sparse Matrix Vector Multiplication for multi-core processors

TLB Blocking

• Heuristic Approach

• Extend cache lines to pages

• Let each cache block only touch TLBSize pages ~ 31 on opteron

• Unnecessary on Cell or Niagara

Page 26: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Register Blocking and Format Selection

• Heuristic Approach• Applied individually to each cache block• re-block the entire cache block into 8x8• Examine all 8x8 tiles and compute how many smaller

power of 2 tiles they would break down into.• e.g. how many 1x1, 2x1, 4x4, 2x8, etc…• Combine with BCOO and BCSR to select the format

x r x c that minimizes the cache block size

• Cell only used 2x1 and larger BCOO

Page 27: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Index Size Selection

• It is possible (always for Cell) that only 16b are required to store the column indices of a cache block

• The high bits are encoded in the coordinates of the block or the DMA list

Page 28: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Architecture Specific Kernels

• All optimizations to this point have been common (perhaps bounded by configurations) to all architectures.

• The kernels which process the resultant sub-matrices can be individually optimized for each architecture

Page 29: Tuning Sparse Matrix Vector Multiplication for multi-core processors

SIMDization

• Cell and x86 support explicit SIMD via intrinsics

• Cell showed a significant speedup

• Opteron was no faster

• Clovertown was no faster (if the correct compiler options were used)

Page 30: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Loop Optimizations

• Few optimizations since the end of one row is the beginning of the next.

• Few tweaks to loop variables

• Possible to software pipeline the kernel

• Possible to implement a branchless version (Cell BCOO worked)

Page 31: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Software Prefetching / DMA

• On Cell, all data is loaded via DMA• It is double buffered only for nonzeros

• On the x86 machines, a hardare prefetcher is supposed to cover the unit-stride streaming access

• We found that explicit NTA prefetches deliver a performance boost

• Niagara(any version) cannot satisfy Little’s law with only multi-threading

• Prefetches would be useful if performance weren’t limited by other problems

Page 32: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Code Generation

• Write a Perl script to generate all kernel variants.• For generic C, x86/Niagara used the same generator• Separate generator for SSE• Separate generator for Cell’s SIMD

• Produce a configuration file for each architecture that limits the optimizations that can be made in the data structure, and their requisite kernels

Page 33: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Tuned Performance

Page 34: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Intel Clovertown

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

1 Core - Naïve 1 Core[PF]

1 Core[PF,RB] 1 Core[PF,RB,CB]

2 Cores[*] 4 Cores[*]

2 Sockets x 4 Cores[*] OSKI

OSKI-PETSc

Intel Clovertown

0%

20%

40%

60%

80%

100%

120%

140%

160%

180%

200%

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

% Speedup for tuning

Benefit of Tuning: +5% single thread +80% 8 threads

Page 35: Tuning Sparse Matrix Vector Multiplication for multi-core processors

AMD X2

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Dense ProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Dual Socket x 2 Core [*]2 Core[*]1 Core[PF,RB,CB]1 Core[PF,RB]1 Core[PF]1 Core - NaïveOSKI-PETScOSKI

AMD X2

0%

50%

100%

150%

200%

250%

300%

350%

400%

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

% Speedup from tuning

Benefit of Tuning: +60% single thread +200% 4 threads

Page 36: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Sun Niagara

0.0

0.3

0.5

0.8

1.0

1.3

1.5

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

1 Core - Naïve 1 Core[PF]

1 Core[PF,RB] 1 Core[PF,RB,CB]

1 Core x 4 Threads[*] 2 Cores x 4 Threads[*]

4 Cores x 4 Threads[*] 8 Cores x 4 Threads[*]

Sun Niagara

-20%

-10%

0%

10%

20%

30%

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

% speedup from tuning

Benefit of Tuning: +5% single thread +4% 32 threads

Page 37: Tuning Sparse Matrix Vector Multiplication for multi-core processors

IBM Cell Blade

• Simpler & less efficient implementation– Only BCOO (branchless)– 2x1 and greater register blocking (no 1 x anything)

• Performance tracks the DMA flop:byte ratio

IBM Cell Broadband Engine

0.001.002.003.004.005.006.007.008.009.00

10.0011.0012.00

DenseProteinFEM-SphrFEM-Cant

TunnelFEM-Har

QCD

FEM-ShipEconomEpidem

FEM-Accel

CircuitWebbase

LP

Median

GFlop/s

Dual Socket x 8 SPEs[RB,DMA,CB]

8 SPEs[RB,DMA,CB]

6 SPEs[RB,DMA,CB]

1 SPE[RB,DMA,CB]

Page 38: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Relative Performance

• Cell is bandwidth bound• Niagara is clearly not• Noticeable Clovertown cache effect for Harbor, QCD, & Econ

Speedup over slowest architecture

0.000

1.000

2.000

3.000

4.000

5.000

6.000

7.000

8.000

9.000

10.000

dense2.puapdb1HYS

consphcant pwtk

rma10 qcd5-4shipsec1

mac_econ_fwd500

mc2depicop20k_Ascircuit

webbaserail4284 Median

Speedup

Cell Opteron Clovertown Niagara

Page 39: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Comments

• complex cores (superscalar, out of order, hardware stream prefetch, giant caches, …) saw the largest benefit from tuning.

• First generation Niagara saw relatively little benefit• Single thread performance was as good or better

performance than OSKI• Parallel Performance was significantly better than

PETSc+OSKI• Benchmark took 20mins, comparable exhaustive

search required 20hrs

Page 40: Tuning Sparse Matrix Vector Multiplication for multi-core processors

Questions?