adaptive multi-level blocking optimization for sparse ... · adaptive multi-level blocking...

AdaptiveMulti-levelBlockingOptimizationforSparseMatrixVectorMultiplication

onGPU

YusukeNagasaka,AkiraNukada,SatoshiMatsuokaTokyoInstituteofTechnology

SparseMatrixComputation• Largeandsparseequations– GeneratedbyFEM– Sparsematrixvectormultiplication(SpMV)requiresmuchexecutiontimeinsolvingequation• Usingsparseformat,whichremovesneedlesszeroelements

– Indirectmemoryaccesstoinputvectorelement» Randommemoryaccess :Frequentcachemisses

– Requireindexofrowandcolumnforeachnon-zeroelement» Aggravatememoryboundness ofMat-vec

=>PerformancedegradationofSpMV

2

10

20

25

15

30

40

45

35

50

0 2 93 6 7

0 2 4 31 0 2 42

10 15 35 4020 25 45 5030

Row pointer

Column id

Value

CSR Format

SpMVonMany-coreProcessors

• Accelerationbyhighparallelismandmemorybandwidth– Smallcacheó Enormousexecutingthreads

• Morefrequentcachemisses– Performanceisstronglylimitedbymemorybandwidth• Notexploitingmanycoreprocessors

3

SparseFormat

• Compressingneedlesszeroelements– Storingonlynon-zeroelements– Reducingmemoryusageandcomputation

• Eachformathasvarietyofcharacteristics– Beingsuitedtoarchitecturesandgivenmatrices

4

Memory Layout Load-balance MemoryAccess toMatrixDataLocalityofaccesstoinputvector

CSR Row-major - -

ELLPACK Column-major - - -

SELL-C-σ Column-major - -

Segmented,NUS Column-major -

BCCOO(yaSpMV) Row-major -

SparseFormat

• Reducingtotalmemoryaccessisimportanttoalleviatethememory-boundness– However,existingworkdoesnotreduceenough– Weshouldconsidertheaccesstobothmatrixdataandinputvectorelements• Reductionoftotalmemoryaccess

5


CSR Row-major - -





Contribution

• WeproposenewsparsematrixformatforGPU– AdaptiveMulti-levelBlocking(AMB)format

• Severaloptimizationtechniquessuchasdivisionandblocking• Aggressivelyreducingtotalmemoryaccessincludingbothmatrixdataandinputvector inSpMV toimproveperformance

• PreciselypredictingtotalmemoryaccessandperformanceofSpMV

– For32matrixdatasetstakenfromFloridasparsematrixcollection,achievespeedupsof• Uptox2.92 comparedtocuSPARSE (x1.74onaverage)• Uptox1.40 comparedtoyaSpMV (x1.13onaverage)

6

AdaptiveMulti-levelBlocking(AMB)

• LargelyreducingtotalmemoryaccessinSpMV• Constructedinmulti-leveldivisionandblocking– Improvingthelocalityoftheaccesstoinputvectorbydividingmatrixandinputvector

– Mainlyfocusingoncompressionofcolumnindex

7


CSR Row-major - -





AMB Proposal Column-major

AMB(AdaptiveMulti-levelBlocking)

• Level1 Column-wiseSegmentation– Improvingthelocalityoftheaccesstoinputvectorelement inSpMV

– Segmentationwidthislessthan65536(=2^16)columns• Forthecompressionofcolumnindexinthesecondleveldivision

8

Segmentationwidth


• Level2:Row-wisesegmentation andcompression– Convertingeachsub-matrixtoSELL-C-σ

• Sortingrowsbythenumberofnon-zeroelementsperrow– Sortingscope(σ)islimited:lessthan32768(2^15)

» σ =6inthisfigure

9

0

1

2

3

4

5

0

1

2

3

4

5Permutation(Originalrowindex)

9

2

1

3

0

2

1

0

0

0

3

2

2

#non-zero/row

Sortingscope(σ)



• Row-wisedivisionbyCrows(C=3inthisfigure)– Cissetbasedonarchitecture(C=32inGPUcase)

10

2

0

4

1

5

3

3

4

5

0

1

2

Permutation

C



• ConvertingeachchunktoELLPACK• MorememoryaccesstoCS,CLandPermutationisrequiredcomparedtooriginalSELL-C-σ (w/oLevel1segmentation)

11

00011

121

2 333

445

50

9

12

21

3

1

3

0

Valuedata ColumnIndex CS(ChunkStartingoffset)

CL(ChunkLength)

204153

345012

Permutation


• Level2:Row-wisesegmentationandcompression– Eliminationoftheemptychunk

• RemovingneedlesselementsofCS,CLandPermutation

1212

00011

121

2 333

445

50

9

12

21

3

1

3

0


CL(ChunkLength)

204153

345012

Permutation


• Level2:Row-wisesegmentationandcompression– Columnindicesarerepresentedby16-bitinteger(unsignedshort)• Originalindexisdividedbysegmentationwidthandremainderisstoredascolumnindex– Quotientisstoredtoupper16-bitofthe‘CL’(ChunkLength)array

1313

00

9

12 0


CL(ChunkLength)

204153

345

Permutation

0011

121

2000

112

23

0 1

1 3


• Level2:Row-wisesegmentationandcompression– Compressionoforiginalrowindex(permutation)byusing16-bitinterger (σ ≤32768(=2^15))• Originalrowindexisdividedbysortingscope(σ=8inthisfigure)andremainderisstoredaspermutation– QuotientisstoredtoPermutationoffsetarray(16-bitinteger)

1414

00

9

12 0

Valuedata ColumnIndex CS CL

204153

345

Permutation

0011

121

2000

112

23

0 1

1 3 0

0

0

Permutationoffset


• Level3:Blocking– Regardingmultiplenon-zeroelementsasoneelement

• Compressionofcontiguouscolumnindices(blocksize=3infigure)• Largeblocksize:Compressionrateishigh,butthenumberofzero-fillinginvaluearraymayincrease

1515

0

0

1

2

3

4

2

3

4

5

Valuedata Columnindex

0

0

1

2

3

4

Valuedata Columnindex

Thisimplies

0 1 20 1 21 2 32 3 43 4 54 5 6


• Level3:Blocking– Compressionofcontiguouscolumnindices– Blocksize:1~10(blocksize=3inthefigure)

• Blocksizeischosenconsideringtradeoffbetweencompressionofcolumnindexandzerofillinginvaluedata

1616

00

9

12 0


204153

345

Permutation

0011

121

2000

112

23

0 1

1 3 0

0

0

Permutationoffset

00

9

12 0


204153

345

Permutation

0011

000

3

0 1

1 3 0

0

0

Permutationoffset

SpMVKernelforAMBformatinCUDA

• ConstructedintwoCUDAkernels– Initializingtheoutputvectorwithzero– Matrixvectormultiplicationkernel

• Onethreadisassignedtoeachrowandtheresultofeachrowisaccumulatedintheoutputvectorbyatomicoperation

17Outputvector

000000

1st InitializationKernel 2nd Matrix-VectorMultiplyKernel

204153

345

Permutation

0

0

0

Permutationoffset

EstimationofTotalMemoryAccessinSpMV

• Performancepredictionbyestimatingtotalmemoryaccess– AMBformathashighlocalityincachelevel

• Wecanestimatetotalmemoryaccessincludingrandomaccess

• Wecanpredicttheperformancewithoutcomputation=>Utilizingforparametertuning

18

OpenCL atomicAddcompare and swap

4.4

AMBAMB

M × N

• v : Byte v = 4 v = 8

• nzz f : ’value’

• b :

• C :

• nC :

’value’ ’column index’ ’CS’ ’CL’ ’Per-mutation offset’ ’Permutation’

nzz f ∗(v +

2b

)+ nC ∗ 4 ∗ 2 + nC ∗ 2 + nC ∗C ∗ 2 (4.1)

atomicAddatomicAdd read write

N ∗ v + M ∗ v + nC ∗C ∗ v ∗ 2 (4.2)

CUDA AMB

nzz f ∗(v +

2b

)+ nC ∗ (2 ∗ v ∗C + 2 ∗C + 10) + (M + N) ∗ v (4.3)

AMD GPU checkcheck

M + nC ∗CAMD GPU

nzz f ∗(v +

2b

)+ nC ∗ (2 ∗ v ∗C + 3 ∗C + 10) + (M + N) ∗ v + M (4.4)

29

M,N:Rowsize,columnsizev:Byteoffloatingpoint(v=4insingleprecisionandv=8indoubleprecision)nzzf :Numberofelementsincludingzero-fillingb :BlocksizeinthirdlevelC:Chunksizeinsecondlevelnc :Numberofchunk

PerformanceEvaluation

19

ExperimentEnvironment

• TSUBAME-KFC– GPU:NVIDIATeslaK20X

• Memorysize :6[GB]• Memorybandwidth:250[GB/s]• ECC off• L2cachesize:1.5[MB]• Read-only cachesize :12[KB]*4/SMX

– CUDA:Version7.0

20

ExperimentEnvironment

• Matrixdata– TheUniversityofFloridaSparseMatrixCollection

• Selecting32positivedefinitesymmetricmatriceswhoserowsizesarelargerthan131072(=2^17)

• Evaluatedsparseformats– cuSPARSE:CSR,HYBRID,BSR(BlockCSR)– SELL-C-σ– {Segmented,Non-Uniformly-Segmented}-SELL-C-σ– yaSpMV:BCCOO onlysingleprecision

21

PerformanceEvaluationSpMVinSingle-precision

• Speedupsof– x2.92onmaximumandx1.74onaverage comparedtocuSPARSE– x1.40onmaximum andx1.13onaverage comparedtoyaSpMV

• Performance ofAMBbecomesworsewhenaveragenumberofnon-zeroelementsperrow(nz/row)issmall

– Benefitofcompressingcolumnindices<CostofatomicAdd

22

0102030405060708090

GFLOPS

SMALL33LARGE

cuSPARSE SELL5C5σ (S,3NUS)5SELL5C5σ yaSpMV AMB

nz/row=24

PerformanceEvaluationSpMVinDoubleprecision

• Speedupsof– x1.86onmaximum andx1.29onaverage comparedtocuSPARSE– x2.59onmaximum andx1.83onaverage comparedtoCSR

• Memoryaccesstovaluedataislargerandcompressionratebecomesworsecomparedtosingleprecision

23

0

10

20

30

40

50

G3_circuit offshore Dubcova3 af_shell3 bmw7st_1 shipsec5 audikw_1

GFLOPS

SMALLEELARGE

cuSPARSE SELLGCGσ (S,ENUS)GSELLGCGσ AMB

PerformanceEvaluationSpMVinSingle-precision

• PerformanceofeachlevelofAMBformat– MemoryaccesstomatrixdataincreasesinLevel1 whilethelocalityoftheaccesstoinputvectorisimproved

– ReducingmemoryaccesstomatrixdatainLevel2andLevel3contributesthespeedups

24

0102030405060708090

GFLOPS

SMALL33LARGE

SELL5C5σ AMB3(Level31,323w/o3compression)

AMB3(Level31,32) AMB3(Full3Level)

TotalMemoryAccess inSpMV

25

• EvaluatingtotalmemoryaccessinSpMV byusingnvprof– FormatwithminimumtotalmemoryaccessshowsbestperformanceofSpMV formostofmatrix

0

0.2

0.4

0.6

0.8

1

1.2

G3_circuit offshore Dubcova3 af_shell3 bmw7st_1 shipsec5 audikw_1

MemoryCTrafficCRatio

comparedCtoCCSR

SMALLCCLARGE

SELLKCKσ yaSpMV AMBC(LevelC1,C2) AMB

better

EstimationofTotalMemoryAccessandPerformancePrediction

• VeryaccurateestimationoftotalmemoryaccessinSpMV withAMBformat

26

0

100

200

300

400

500

600

0 100 200 300 400 500 600

Estimated0Total0Memory0Access0[MB]

Measured0Total0Memory0Access0[MB]

Correlationcoefficient=0.999

EstimationofTotalMemoryAccessandPerformancePrediction

• PerformanceofSpMVdependsonmemoryaccess– Performancedegradationbyload-imbalanceforsomedata

27

0

0.5

1

1.5

2

2.5

3

0 100 200 300 400 500 600

Execution2Time2of2SpMV2with2AMB2

format2[msec]

Estimated2Total2Memory2Access2[MB]

RelatedWork(1)

• CoAdELL [Maggioni,IPDPS2014]– Deltabetweencolumnindicesiscompressedto16-bitor8-bit

– Nocompressionwhendeltaislarge• BROformat[Tang,SC13]– Deltabetweencolumnindicesisrepresentedinmoreprecisenumberofbits

– OptimizedforGPUbyremovingsequentiallyexecution

– AlthoughBROsuccessfullyreducesthememoryfootprint,itrequiresadditionaldecompressionschemes

28

0 2 3

0 1

1

3 4

4

5

5

5

5

6

6

7

7

0 2 1

0 1

1

3 1

4

1

1

4

4

1

1

1

2

Column index

0

5

10

15

20

25

30

35

shipsec1 consph pdb1HYSGFLOPS

Matrix

BRO?ELL AMB

RelatedWork(2)

• BCCOO(yaSpMV)[Yan,PPoPP’14]– StoringthematrixdatainextendedCOOwithblockingthematrixtoreducethememoryusageoftheindexofcolumnandrow• LocalmemorysuchassharedmemoryonGPUiseffectivelyutilizedwhenthecomputationresultofeachblockisaccumulatedinSpMV

=> HighlyacceleratingSpMV fromexistinglibrarysuchascuSPARSE

– Parameterauto-tuningmechanism• Findbestsetofparametersbyconversionsandexecutions• Largeparameterspace

– Costofparametertuningisextremelyhigh

29

Conclusion

• ToimprovetheperformanceofSpMVonGPU,weproposeAMB(AdaptiveMulti-levelBlocking)formatwhichreducestotalmemoryaccess– Multi-leveldivisionandblockingimprovethelocalityoftheaccesstoinputvectorandcompressescolumnindex

– HighperformanceimprovementfromexistingSpMVlibraries suchascuSPARSEandyaSpMV

• Futurework– InvestigatingtheeffectivenessofourAMBformatonothermany-coreprocessorssuchasAMDGPUsandXeonPhi

30

Backup

31

Zero-fillinginValuedata

• Zero-fillinginvaluedataisnotsomany– Tableshowstheresultofsingleprecision

32

Matrix Bestblocksize #zero-fill/#non-zero[%] Total memoryaccessof(Bestblocksize)/(blocksize=1)[%]

af_shell3 5 0.096 75.62

audikw_1 3 0.451 79.77

bmw7st_1 6 4.709 82.22

Dubcova3 2 21.691 99.97

G3_circuit 1 0.070 100.00

offshore 1 0.212 100.00

shipsec5 6 1.731 74.92

Loopunrolling• LoopunrollingtechniqueisusedforSpMV– DivergenceeffectislargeonGPU

33

Blocksize=1

sum=0;adr=thread_id;for(i =0;i <3;i++){sum+=value[adr]*vec[col[adr]];adr +=4;

}output+=sum;

Blocksize=3

sum=0;adr =thread_idfor(i =0;i <3/3;i++){c=col[adr /3+adr %4];sum+=value[adr]*vec[c];sum+=value[adr +4*1]*vec[c+1];sum+=value[adr +4*2]*vec[c+2];adr +=4;

}output+=sum;

Thread0

Thread1

Thread2

Thread3Valuedata

0

0

0

1

1

1

2

0Columnindex

Thread0

Thread1

Thread2

Thread3

0

0

0

0ColumnindexValuedata

TotalMemoryAccessandExecutiontime

• Exectution timedependsontotalmemoryaccessinSELL-C- σ,yaSpMV andAMB– AMBalsoachieveshighthroughputalthoughtotalmemoryaccessislessthanotherformats

340

1

2

3

4

5

6

7

8

9

10

0 100 200 300 400 500 600 700 800 900 1000

Execution4Time4of4SpMV4[msec]

Measured4Memory4Traffic4[MB]

CSR SELLGCGσ yaSpMV AMB

adaptive multi-level blocking optimization for sparse ... · adaptive multi-level blocking...

Documents