adaptive multi-level blocking optimization for sparse ... · adaptive multi-level blocking...

34
Adaptive Multi-level Blocking Optimization for Sparse Matrix Vector Multiplication on GPU Yusuke Nagasaka, Akira Nukada, Satoshi Matsuoka Tokyo Institute of Technology

Upload: others

Post on 29-Jan-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

  • AdaptiveMulti-levelBlockingOptimizationforSparseMatrixVectorMultiplication

    onGPU

    YusukeNagasaka,AkiraNukada,SatoshiMatsuokaTokyoInstituteofTechnology

  • SparseMatrixComputation• Largeandsparseequations– GeneratedbyFEM– Sparsematrixvectormultiplication(SpMV)requiresmuchexecutiontimeinsolvingequation• Usingsparseformat,whichremovesneedlesszeroelements

    – Indirectmemoryaccesstoinputvectorelement» Randommemoryaccess :Frequentcachemisses

    – Requireindexofrowandcolumnforeachnon-zeroelement» Aggravatememoryboundness ofMat-vec

    =>PerformancedegradationofSpMV

    2

    10

    20

    25

    15

    30

    40

    45

    35

    50

    0 2 93 6 7

    0 2 4 31 0 2 42

    10 15 35 4020 25 45 5030

    Row pointer

    Column id

    Value

    CSR Format

  • SpMVonMany-coreProcessors

    • Accelerationbyhighparallelismandmemorybandwidth– Smallcacheó Enormousexecutingthreads

    • Morefrequentcachemisses– Performanceisstronglylimitedbymemorybandwidth• Notexploitingmanycoreprocessors

    3

  • SparseFormat

    • Compressingneedlesszeroelements– Storingonlynon-zeroelements– Reducingmemoryusageandcomputation

    • Eachformathasvarietyofcharacteristics– Beingsuitedtoarchitecturesandgivenmatrices

    4

    Memory Layout Load-balance MemoryAccess toMatrixDataLocalityofaccesstoinputvector

    CSR Row-major - -

    ELLPACK Column-major - - -

    SELL-C-σ Column-major - -

    Segmented,NUS Column-major -

    BCCOO(yaSpMV) Row-major -

  • SparseFormat

    • Reducingtotalmemoryaccessisimportanttoalleviatethememory-boundness– However,existingworkdoesnotreduceenough– Weshouldconsidertheaccesstobothmatrixdataandinputvectorelements• Reductionoftotalmemoryaccess

    5

    Memory Layout Load-balance MemoryAccess toMatrixDataLocalityofaccesstoinputvector

    CSR Row-major - -

    ELLPACK Column-major - - -

    SELL-C-σ Column-major - -

    Segmented,NUS Column-major -

    BCCOO(yaSpMV) Row-major -

  • Contribution

    • WeproposenewsparsematrixformatforGPU– AdaptiveMulti-levelBlocking(AMB)format

    • Severaloptimizationtechniquessuchasdivisionandblocking• Aggressivelyreducingtotalmemoryaccessincludingbothmatrixdataandinputvector inSpMV toimproveperformance

    • PreciselypredictingtotalmemoryaccessandperformanceofSpMV

    – For32matrixdatasetstakenfromFloridasparsematrixcollection,achievespeedupsof• Uptox2.92 comparedtocuSPARSE (x1.74onaverage)• Uptox1.40 comparedtoyaSpMV (x1.13onaverage)

    6

  • AdaptiveMulti-levelBlocking(AMB)

    • LargelyreducingtotalmemoryaccessinSpMV• Constructedinmulti-leveldivisionandblocking– Improvingthelocalityoftheaccesstoinputvectorbydividingmatrixandinputvector

    – Mainlyfocusingoncompressionofcolumnindex

    7

    Memory Layout Load-balance MemoryAccess toMatrixDataLocalityofaccesstoinputvector

    CSR Row-major - -

    ELLPACK Column-major - - -

    SELL-C-σ Column-major - -

    Segmented,NUS Column-major -

    BCCOO(yaSpMV) Row-major -

    AMB Proposal Column-major

  • AMB(AdaptiveMulti-levelBlocking)

    • Level1 Column-wiseSegmentation– Improvingthelocalityoftheaccesstoinputvectorelement inSpMV

    – Segmentationwidthislessthan65536(=2^16)columns• Forthecompressionofcolumnindexinthesecondleveldivision

    8

    Segmentationwidth

  • AMB(AdaptiveMulti-levelBlocking)

    • Level2:Row-wisesegmentation andcompression– Convertingeachsub-matrixtoSELL-C-σ

    • Sortingrowsbythenumberofnon-zeroelementsperrow– Sortingscope(σ)islimited:lessthan32768(2^15)

    » σ =6inthisfigure

    9

    0

    1

    2

    3

    4

    5

    0

    1

    2

    3

    4

    5Permutation(Originalrowindex)

    9

    2

    1

    3

    0

    2

    1

    0

    0

    0

    3

    2

    2

    #non-zero/row

    Sortingscope(σ)

  • AMB(AdaptiveMulti-levelBlocking)

    • Level2:Row-wisesegmentation andcompression– Convertingeachsub-matrixtoSELL-C-σ

    • Row-wisedivisionbyCrows(C=3inthisfigure)– Cissetbasedonarchitecture(C=32inGPUcase)

    10

    2

    0

    4

    1

    5

    3

    3

    4

    5

    0

    1

    2

    Permutation

    C

  • AMB(AdaptiveMulti-levelBlocking)

    • Level2:Row-wisesegmentation andcompression– Convertingeachsub-matrixtoSELL-C-σ

    • ConvertingeachchunktoELLPACK• MorememoryaccesstoCS,CLandPermutationisrequiredcomparedtooriginalSELL-C-σ (w/oLevel1segmentation)

    11

    00011

    121

    2 333

    445

    50

    9

    12

    21

    3

    1

    3

    0

    Valuedata ColumnIndex CS(ChunkStartingoffset)

    CL(ChunkLength)

    204153

    345012

    Permutation

  • AMB(AdaptiveMulti-levelBlocking)

    • Level2:Row-wisesegmentationandcompression– Eliminationoftheemptychunk

    • RemovingneedlesselementsofCS,CLandPermutation

    1212

    00011

    121

    2 333

    445

    50

    9

    12

    21

    3

    1

    3

    0

    Valuedata ColumnIndex CS(ChunkStartingoffset)

    CL(ChunkLength)

    204153

    345012

    Permutation

  • AMB(AdaptiveMulti-levelBlocking)

    • Level2:Row-wisesegmentationandcompression– Columnindicesarerepresentedby16-bitinteger(unsignedshort)• Originalindexisdividedbysegmentationwidthandremainderisstoredascolumnindex– Quotientisstoredtoupper16-bitofthe‘CL’(ChunkLength)array

    1313

    00

    9

    12 0

    Valuedata ColumnIndex CS(ChunkStartingoffset)

    CL(ChunkLength)

    204153

    345

    Permutation

    0011

    121

    2000

    112

    23

    0 1

    1 3

  • AMB(AdaptiveMulti-levelBlocking)

    • Level2:Row-wisesegmentationandcompression– Compressionoforiginalrowindex(permutation)byusing16-bitinterger (σ ≤32768(=2^15))• Originalrowindexisdividedbysortingscope(σ=8inthisfigure)andremainderisstoredaspermutation– QuotientisstoredtoPermutationoffsetarray(16-bitinteger)

    1414

    00

    9

    12 0

    Valuedata ColumnIndex CS CL

    204153

    345

    Permutation

    0011

    121

    2000

    112

    23

    0 1

    1 3 0

    0

    0

    Permutationoffset

  • AMB(AdaptiveMulti-levelBlocking)

    • Level3:Blocking– Regardingmultiplenon-zeroelementsasoneelement

    • Compressionofcontiguouscolumnindices(blocksize=3infigure)• Largeblocksize:Compressionrateishigh,butthenumberofzero-fillinginvaluearraymayincrease

    1515

    0

    0

    1

    2

    3

    4

    2

    3

    4

    5

    Valuedata Columnindex

    0

    0

    1

    2

    3

    4

    Valuedata Columnindex

    Thisimplies

    0 1 20 1 21 2 32 3 43 4 54 5 6

  • AMB(AdaptiveMulti-levelBlocking)

    • Level3:Blocking– Compressionofcontiguouscolumnindices– Blocksize:1~10(blocksize=3inthefigure)

    • Blocksizeischosenconsideringtradeoffbetweencompressionofcolumnindexandzerofillinginvaluedata

    1616

    00

    9

    12 0

    Valuedata ColumnIndex CS CL

    204153

    345

    Permutation

    0011

    121

    2000

    112

    23

    0 1

    1 3 0

    0

    0

    Permutationoffset

    00

    9

    12 0

    Valuedata ColumnIndex CS CL

    204153

    345

    Permutation

    0011

    000

    3

    0 1

    1 3 0

    0

    0

    Permutationoffset

  • SpMVKernelforAMBformatinCUDA

    • ConstructedintwoCUDAkernels– Initializingtheoutputvectorwithzero– Matrixvectormultiplicationkernel

    • Onethreadisassignedtoeachrowandtheresultofeachrowisaccumulatedintheoutputvectorbyatomicoperation

    17Outputvector

    000000

    1st InitializationKernel 2nd Matrix-VectorMultiplyKernel

    204153

    345

    Permutation

    0

    0

    0

    Permutationoffset

  • EstimationofTotalMemoryAccessinSpMV

    • Performancepredictionbyestimatingtotalmemoryaccess– AMBformathashighlocalityincachelevel

    • Wecanestimatetotalmemoryaccessincludingrandomaccess

    • Wecanpredicttheperformancewithoutcomputation=>Utilizingforparametertuning

    18

    OpenCL atomicAddcompare and swap

    4.4

    AMBAMB

    M × N

    • v : Byte v = 4 v = 8

    • nzz f : ’value’

    • b :

    • C :

    • nC :

    ’value’ ’column index’ ’CS’ ’CL’ ’Per-mutation offset’ ’Permutation’

    nzz f ∗(v +

    2b

    )+ nC ∗ 4 ∗ 2 + nC ∗ 2 + nC ∗C ∗ 2 (4.1)

    atomicAddatomicAdd read write

    N ∗ v + M ∗ v + nC ∗C ∗ v ∗ 2 (4.2)

    CUDA AMB

    nzz f ∗(v +

    2b

    )+ nC ∗ (2 ∗ v ∗C + 2 ∗C + 10) + (M + N) ∗ v (4.3)

    AMD GPU checkcheck

    M + nC ∗CAMD GPU

    nzz f ∗(v +

    2b

    )+ nC ∗ (2 ∗ v ∗C + 3 ∗C + 10) + (M + N) ∗ v + M (4.4)

    29

    M,N:Rowsize,columnsizev:Byteoffloatingpoint(v=4insingleprecisionandv=8indoubleprecision)nzzf :Numberofelementsincludingzero-fillingb :BlocksizeinthirdlevelC:Chunksizeinsecondlevelnc :Numberofchunk

  • PerformanceEvaluation

    19

  • ExperimentEnvironment

    • TSUBAME-KFC– GPU:NVIDIATeslaK20X

    • Memorysize :6[GB]• Memorybandwidth:250[GB/s]• ECC off• L2cachesize:1.5[MB]• Read-only cachesize :12[KB]*4/SMX

    – CUDA:Version7.0

    20

  • ExperimentEnvironment

    • Matrixdata– TheUniversityofFloridaSparseMatrixCollection

    • Selecting32positivedefinitesymmetricmatriceswhoserowsizesarelargerthan131072(=2^17)

    • Evaluatedsparseformats– cuSPARSE:CSR,HYBRID,BSR(BlockCSR)– SELL-C-σ– {Segmented,Non-Uniformly-Segmented}-SELL-C-σ– yaSpMV:BCCOO onlysingleprecision

    21

  • PerformanceEvaluationSpMVinSingle-precision

    • Speedupsof– x2.92onmaximumandx1.74onaverage comparedtocuSPARSE– x1.40onmaximum andx1.13onaverage comparedtoyaSpMV

    • Performance ofAMBbecomesworsewhenaveragenumberofnon-zeroelementsperrow(nz/row)issmall

    – Benefitofcompressingcolumnindices<CostofatomicAdd

    22

    0102030405060708090

    GFLOPS

    SMALL33LARGE

    cuSPARSE SELL5C5σ (S,3NUS)5SELL5C5σ yaSpMV AMB

    nz/row=24

  • PerformanceEvaluationSpMVinDoubleprecision

    • Speedupsof– x1.86onmaximum andx1.29onaverage comparedtocuSPARSE– x2.59onmaximum andx1.83onaverage comparedtoCSR

    • Memoryaccesstovaluedataislargerandcompressionratebecomesworsecomparedtosingleprecision

    23

    0

    10

    20

    30

    40

    50

    G3_circuit offshore Dubcova3 af_shell3 bmw7st_1 shipsec5 audikw_1

    GFLOPS

    SMALLEELARGE

    cuSPARSE SELLGCGσ (S,ENUS)GSELLGCGσ AMB

  • PerformanceEvaluationSpMVinSingle-precision

    • PerformanceofeachlevelofAMBformat– MemoryaccesstomatrixdataincreasesinLevel1 whilethelocalityoftheaccesstoinputvectorisimproved

    – ReducingmemoryaccesstomatrixdatainLevel2andLevel3contributesthespeedups

    24

    0102030405060708090

    GFLOPS

    SMALL33LARGE

    SELL5C5σ AMB3(Level31,323w/o3compression)

    AMB3(Level31,32) AMB3(Full3Level)

  • TotalMemoryAccess inSpMV

    25

    • EvaluatingtotalmemoryaccessinSpMV byusingnvprof– FormatwithminimumtotalmemoryaccessshowsbestperformanceofSpMV formostofmatrix

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    G3_circuit offshore Dubcova3 af_shell3 bmw7st_1 shipsec5 audikw_1

    MemoryCTrafficCRatio

    comparedCtoCCSR

    SMALLCCLARGE

    SELLKCKσ yaSpMV AMBC(LevelC1,C2) AMB

    better

  • EstimationofTotalMemoryAccessandPerformancePrediction

    • VeryaccurateestimationoftotalmemoryaccessinSpMV withAMBformat

    26

    0

    100

    200

    300

    400

    500

    600

    0 100 200 300 400 500 600

    Estimated0Total0Memory0Access0[MB]

    Measured0Total0Memory0Access0[MB]

    Correlationcoefficient=0.999

  • EstimationofTotalMemoryAccessandPerformancePrediction

    • PerformanceofSpMVdependsonmemoryaccess– Performancedegradationbyload-imbalanceforsomedata

    27

    0

    0.5

    1

    1.5

    2

    2.5

    3

    0 100 200 300 400 500 600

    Execution2Time2of2SpMV2with2AMB2

    format2[msec]

    Estimated2Total2Memory2Access2[MB]

  • RelatedWork(1)

    • CoAdELL [Maggioni,IPDPS2014]– Deltabetweencolumnindicesiscompressedto16-bitor8-bit

    – Nocompressionwhendeltaislarge• BROformat[Tang,SC13]– Deltabetweencolumnindicesisrepresentedinmoreprecisenumberofbits

    – OptimizedforGPUbyremovingsequentiallyexecution

    – AlthoughBROsuccessfullyreducesthememoryfootprint,itrequiresadditionaldecompressionschemes

    28

    0 2 3

    0 1

    1

    3 4

    4

    5

    5

    5

    5

    6

    6

    7

    7

    0 2 1

    0 1

    1

    3 1

    4

    1

    1

    4

    4

    1

    1

    1

    2

    Column index

    0

    5

    10

    15

    20

    25

    30

    35

    shipsec1 consph pdb1HYSGFLOPS

    Matrix

    BRO?ELL AMB

  • RelatedWork(2)

    • BCCOO(yaSpMV)[Yan,PPoPP’14]– StoringthematrixdatainextendedCOOwithblockingthematrixtoreducethememoryusageoftheindexofcolumnandrow• LocalmemorysuchassharedmemoryonGPUiseffectivelyutilizedwhenthecomputationresultofeachblockisaccumulatedinSpMV

    => HighlyacceleratingSpMV fromexistinglibrarysuchascuSPARSE

    – Parameterauto-tuningmechanism• Findbestsetofparametersbyconversionsandexecutions• Largeparameterspace

    – Costofparametertuningisextremelyhigh

    29

  • Conclusion

    • ToimprovetheperformanceofSpMVonGPU,weproposeAMB(AdaptiveMulti-levelBlocking)formatwhichreducestotalmemoryaccess– Multi-leveldivisionandblockingimprovethelocalityoftheaccesstoinputvectorandcompressescolumnindex

    – HighperformanceimprovementfromexistingSpMVlibraries suchascuSPARSEandyaSpMV

    • Futurework– InvestigatingtheeffectivenessofourAMBformatonothermany-coreprocessorssuchasAMDGPUsandXeonPhi

    30

  • Backup

    31

  • Zero-fillinginValuedata

    • Zero-fillinginvaluedataisnotsomany– Tableshowstheresultofsingleprecision

    32

    Matrix Bestblocksize #zero-fill/#non-zero[%] Total memoryaccessof(Bestblocksize)/(blocksize=1)[%]

    af_shell3 5 0.096 75.62

    audikw_1 3 0.451 79.77

    bmw7st_1 6 4.709 82.22

    Dubcova3 2 21.691 99.97

    G3_circuit 1 0.070 100.00

    offshore 1 0.212 100.00

    shipsec5 6 1.731 74.92

  • Loopunrolling• LoopunrollingtechniqueisusedforSpMV– DivergenceeffectislargeonGPU

    33

    Blocksize=1

    sum=0;adr=thread_id;for(i =0;i <3;i++){sum+=value[adr]*vec[col[adr]];adr +=4;

    }output+=sum;

    Blocksize=3

    sum=0;adr =thread_idfor(i =0;i <3/3;i++){c=col[adr /3+adr %4];sum+=value[adr]*vec[c];sum+=value[adr +4*1]*vec[c+1];sum+=value[adr +4*2]*vec[c+2];adr +=4;

    }output+=sum;

    Thread0

    Thread1

    Thread2

    Thread3Valuedata

    0

    0

    0

    1

    1

    1

    2

    0Columnindex

    Thread0

    Thread1

    Thread2

    Thread3

    0

    0

    0

    0ColumnindexValuedata

  • TotalMemoryAccessandExecutiontime

    • Exectution timedependsontotalmemoryaccessinSELL-C- σ,yaSpMV andAMB– AMBalsoachieveshighthroughputalthoughtotalmemoryaccessislessthanotherformats

    340

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 100 200 300 400 500 600 700 800 900 1000

    Execution4Time4of4SpMV4[msec]

    Measured4Memory4Traffic4[MB]

    CSR SELLGCGσ yaSpMV AMB