efficient sparse matrix-matrix multiplication on heterogeneous high performance systems aacec 2010...

25
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1 , Oreste Villa 2 , Sriram Krishnamoorthy 2 , Antonino Tumeo 2 and Xiaoming Li 1 1 University of Delaware 2 Pacific Northwest National Laboratory 1 September 24 th , 2010

Upload: meghan-hutchins

Post on 16-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems

AACEC 2010 – Heraklion, Crete, Greece

Jakob Siegel1, Oreste Villa2, Sriram Krishnamoorthy2, Antonino Tumeo2 and Xiaoming Li1

1 University of Delaware2 Pacific Northwest National Laboratory

1

September 24th, 2010

Page 2: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Overview

IntroductionCluster levelNode levelResultsConclusionFuture Work

2

Page 3: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Overview

IntroductionCluster levelNode levelResultsConclusionFuture Work

3

Page 4: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Sparse Matrix-Matrix Multiply- Challenges

The efficient implementation of sparse matrix-matrix multiplications on HPC systems poses several challenges:

Large size of input matricesE.g. 106×106 with 30×106 nonzero elements

Compressed representationPartitioningDensity of the output matricesLoad balancing

large differences in density and computation times

4

Matrices taken from Timothy A. Davis. University of Florida Sparse Matrix Collection, available online at: http://www.cise.ufl.edu/davis/sparse.

Page 5: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Sparse Matrix-Matrix Multiply

Cross Cluster implementation:PartitioningData DistributionLoad BalancingCommunication/ScalingResult handling

In-Node implementation:Multiple efficient SpGEMM algorithms

CPU/GPU implementation Double bufferingExploiting heterogeneity

5

Matrices taken from Timothy A. Davis. University of Florida Sparse Matrix Collection, available online at: http://www.cise.ufl.edu/davis/sparse.

Page 6: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Overview

IntroductionCluster level

Node levelResultsConclusionFuture Work

6

Page 7: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Sparse Matrix-Matrix Multiply- Cluster level

BlockingBlock size depends on sparsity of input matrices and # processing elements. NumOfBlocksX × NumOfBlocksY >> NumOfProcessingElements

Data LayoutWhat format and order to allow for easy and fast access

Communication and storage implemented using Global Arrays (GA)

Offers a set of primitives for non-blocking operations, contiguous and non-contiguous data transfers.

7

Page 8: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Sparse Matrix-Matrix Multiply- Data representation and Tiling

8

A

B

C

C=A×B

• Blocked Matrix representation:• Each block is stored in

CSR* form

1 -1 0 0 0 0 5 0 0 0 0 0 4 6 0-2 0 2 7 0 0 0 0 0 5

data (1 -1 5 4 6 -2 2 7 5) col (0 1 1 2 3 0 2 3 4) row (0 2 3 5 8 9)

*CSR: Compressed Sparse Row

Page 9: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Sparse Matrix-Matrix Multiply- Data representation and Tiling

9

A

B

C

C=A×B

data column row data col…Tile 0 Tile 2

• Matrix A:• The single CSR tiles are stored serialized into

the GA space. • Tile sizes and offsets are stored in a 2D array• Tiles with 0 nonzero elements are not

represented in the GA dataset.

Page 10: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Sparse Matrix-Matrix Multiply- Data representation and Tiling

10

B• Matrix B:

• tiles are serialized in a transposed way.

• depending on the algorithm used to calculate the single tiles the data in the tiles can be stored transposed or not transposed.

• For the Gustavson algorithm the representation of the data in the tiles themselves is not transposed.

1 -1 0 0 0 0 5 0 0 0 0 0 4 6 0-2 0 2 7 0 0 0 0 0 5

1 0 0 -2 0-1 5 0 0 0 0 0 4 2 0 0 0 6 7 0 0 0 0 0 5

not transposed

or

transposed

Page 11: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Sparse Matrix-Matrix Multiply- Tasking and Data Movement

11

0 1 2 3 45 6 7 8 ..

1

C

• Each Block in C represents a Task.

• Nodes grab tasks and additional needed data when they have computational power available

• Results are stored locally

• meta data of the result blocks in each node is distributed to determine the offsets of the tiles in the GA space.

• Tiles are put into the GA space in right order

0 1 N-1…

34

0 25

Page 12: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Sparse Matrix-Matrix Multiply- Tasking and Data Movement

12

A

BC=A×B

• Each node fetches the data needed by the task to handle:

E.g. here for task/tile 5 the node has to load the data of Stripes sa = 1 and sb = 0

N-1

25

012…

Sa-1

0 1 2 …Sb-1

Page 13: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Overview

IntroductionCluster level

Node levelResultsConclusionFuture Work

14

Page 14: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

2 3 0 0 0 0 0 -1 0 2 3 0 0 0 -3 1 0 0 0 0 2 3 0 0 1 0 0 2 2 0 0 0 0 2 -1 4

1 -1 0 0 0 0 5 0 0 0 0 0 4 6 0-2 0 0 7 -4 0 1 0 0 5 0 0 0 1 2

Sparse Matrix-Matrix Multiply - Gustavson

15

The algorithm is based on the equation:

i-th row of C is a linear combination of the v rows of B for which aiv is nonzero. Where A has the dimensions p×q and B q×r

0 -5 0 0 0-4 -5 0 14 -8-4 -2 0 14 70 0 0 0 0

×

data(2,3,-1,2,3,-3,1,2,3,1,2,2,2,-1,4) col (0,1, 1,3,4, 2,3,2,3,0,3,4,3, 4,5) row (0,2,5,7,9,12,15)

data(1,-1,5,4,6,-2,7,-4,1,5,1,2) col (0, 1,1,2,3, 0,3, 4,1,4,3,4) row (0,2,3,5,8,10,12)

A C

B

×

pibaciva

vivi

1for 0

i=1i=1, v=1i=1, v=3i=1, v=4

++

×

+

Page 15: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

2 3 0 0 0 0 0 -1 0 2 3 0 0 0 -3 1 0 0 0 0 2 3 0 0 1 0 0 2 2 0 0 0 0 2 -1 4

1 -1 0 0 0 0 5 0 0 0 0 0 4 6 0-2 0 0 7 -4 0 1 0 0 5 0 0 0 1 2

Sparse Matrix-Matrix Multiply - Gustavson

16

A C

B

pibaciva

vivi

1for 0

In the CUDA implementation:

• each result row ci is handled by the 16 threads of a half warp (1/2W)

• For each nonzero elements aiv in A one 1/2W performs the multiplications for each row v· in parallel

• The results are kept in dense form until all calculations are complete

• Then the results get compressed on the device.

0 0 0 0 00 0 0 0 00 0 0 0 00 0 0 0 00 0 0 0 00 0 0 0 0

2 13 0 0 0-4 -2 0 14 7-2 0 -12-11 -4-6 0 8 33 -12-3 -1 0 14 2-4 -1 0 18 -5

half-warp 0half-warp 1half-warp 2…

Page 16: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Overview

IntroductionCluster levelNode level

ResultsConclusionFuture Work

17

Page 17: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Sparse Matrix-Matrix Multiply – Case Study

Midsize matrix from the University of Florida Sparse Matrix Collection*

2D/3D problemsize 72, 000 × 72, 000 28, 715, 634 nonzeroBlocked into 5041 tiles. Multiplying matrix with itself.

18

*http://www.cise.ufl.edu/davis/sparse

Darker colors represent higher densities of nonzero elements.

Page 18: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

19

Sparse Matrix-Matrix Multiply - Results

Scaling of SpGEMM with the different approaches

1 2 4 8 160

50

100

150

200

250

300

350

execution time over number of nodesStatic

LB-Hom

LB-Het

Nodes

tim

e in

sec

Page 19: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Sparse Matrix-Matrix Multiply - Results

20

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15200

250

300

350

400

tasks executed by each nodeStatic

LB-Hom

LB-Het

node id

nu

mb

er o

f ta

sks

00000001111111222222233333334444444555555566666667777777888888899999991010101010101011111111111111121212121212121313131313131314141414141414151515151515150

5

10

15

20

25

30

Time to complete all assigned tasks per process

Static

LB-Hom

LB-Het

node id (7 processes per node)

tim

e in

sec

Page 20: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Sparse Matrix-Matrix Multiply - Results

Even inside a node where different compute elements are used the load balancing mechanism still performs well

The processes using the CUDA devices here completing almost 5x more tasks than the pure CPU processes.

21

Static

CPU1

CPU3

CPU5

CPU0

CPU2

CPU4

CPU6

LB-H

et

CUDA1

CPU1

CPU3

0

20

40

60

80

100

120

Tasks per Core in one of the nodes

nu

mb

er o

f ta

sks

Static

CPU1

CPU3

CPU5

CPU0

CPU2

CPU4

CPU6

LB-H

et

CUDA1

CPU1

CPU3

0

5

10

15

20

25

Time to complete all assigned tasks for each processor

tim

e in

sec

Page 21: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Overview

IntroductionCluster levelNode levelResults

ConclusionFuture Work

22

Page 22: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Sparse Matrix-Matrix Multiply

We presented a parallel framework using a co-design approach which takes into account characteristics of:

The selected application (here SpGEMM)The underlying hardware (heterogeneous cluster)

The difficulties of using static partitioning approaches show that a global load balancing method is neededDifferent optimized implementations of the Gustavson algorithm are presented and are used depending on the available compute elementFor the selected case study optimal load balancing with uniform computation time across all processing elements is achieved

23

Page 23: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Overview

IntroductionCluster levelNode levelResultsConclusion

Future Work

24

Page 24: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Future Work – General Tasking Framework for Heterogeneous GPU Clusters

More General Task definitionMore flexibility in Input and output data definitionExploring limits imposed on Tasks by a Heterogeneous system

Feedback loop during execution that allows more efficient assignment of tasks.Introducing heterogeneous execution on GPU and CPU in one process/core.Locality aware Task queue(s) and work stealingTask reinsertion or generation at the node level.

25

Page 25: Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa

Thank you

Questions?

26