parallel cholesky decomposition in julia (6.338...

ParallelCholeskyDecompositioninJulia(6.338Project)

OmarMysore

December16,2011

ProjectIntroduction

Sparsematricesdominatenumericalsimulationandtechnicalcomputing.Their

applicationsarepracticallylimitless,rangingfromsolvingpartialdifferentialequationsto

convexoptimizationtobigdata.Becauseofthewideuseofsparsematrices,theobjective

ofthisprojectwastoimplementsparseCholeskydecompositioninparallelusingtheJulia

language.InadditiontodevelopingtheCholeskydecompositionfeatureintheJulia

languageandinvestigatingtheeffectsofparallelization,thisprojectservedthepurposeof

aidingthedevelopmentofsparsematrixsupportinJulia.

AworkingsparseparallelCholeskydecompositionsolverwasdeveloped.Although,

thecurrentimplementationcontainslimitsintermsofspeedandcapability,itishopedthat

thisimplementationcanserveasameanstofurtherdevelopment.Theremainderofthis

reportwilldiscussthebasicsofCholeskydecompositionandthealgorithmused,the

opportunitiesandmethodsofparallelization,theresults,andthepotentialforfuturework.

CholeskyDecomposition

TheCholeskydecompositionofasymmetricpositivedefinitematrixAdeterminesthe

lower‐triangularmatrixL,whereLLT=A.Althoughitislimitedtosymmetricpositive

definitematrices,thesematricesoftenappearinfieldssuchasconvexoptimization.The

followingequationsareusedfrom[2].

ThebasicdenseCholeskydecompositionalgorithmconsistsofrepeatedly

performingthefactorizationbelowonamatrix,Aofsizen,wheredisaconstantvalueand

dandvisann‐1by1columnvector.

Oncethisfactorizationiscompleted,thefirstcolumnofLisdetermined.Thenthesame

factorizationisperformedonC‐vv/d,andtheprocessisrepeateduntilallofthecolumnsof

Larefound.Similarly,thesameprocesscanbedoneforblockmatrices:

CholeskyDecompositionforSparseMatrices

Forsparsematrices,severaladditionalstepsaretakeninordertotakeadvantageofthe

substantiallyfewernonzerovalues.Firstthefill‐ins,andthestructureofLaredetermine,

withoutcalculatingthevalues.Nextthetreeofdependenciesisfound,andfinallythe

valuesofLarecalculated.Foradetailedexplanationofthemethodsummarizedinthis

report,see[2].

ThefirststepinsparseCholeskydecompositionistodeterminethestructureofL

withoutexplitlydeterminingL.AllofthefollowingimagesareusedfromJohnGilbert’s

slides[1].SupposeAhasthestructurebelow:

Then,thegoalwouldbetodeterminethestructureofL,whichisshowbelow:

Thereddotsareknownasthefill‐ins,sincethesevaluesarenonzeroinL,butzeroinA.

AlthoughthestructureofLisknown,noneofthespecificvaluesareknown.Inorderto

determinethestructureofL,thegraphsofthematricesareused.Belowisthegraphofthe

previouslyshownmatrixA:

Fromthisgraph,thegraphofLcaneasilybedeterminedbyconnectingthehigher

numberedneighborsofeachnode/columninthegraph.BelowisthegraphofLwiththe

fill‐insinred:

OncethestructureofLisdetermined,thedependencytreecanbedetermined.Inorderto

calculatethedependencytree,theparentofeachnodemustbedetermined,andtheparent

ofagivennode/columnistheminimumrowvalueofanonzerovalueinthegivencolumn

ofL,notincludingthediagonalvalues.Forexample,forthegraphofLpreviouslyshown,

thedependencytreeisshownbelow:

Afterthedependencytreeisdetermined,thevaluesofLcanbecalculated.Thebasic

equationsfordeterminingeachcolumnofLareshownbelow:

ForeachcolumnjinA

DeterminethefrontalmatrixFj,where

Here,T[j]‐{j}representallofchildrennodes,whichhavetoalready

havebeendeterminedinordertocalculateLj.

ThisisthebasicformulationfordeterminingthesparseCholeskydecomposition.All

equationsandimagesusedinthisandtheprevioussectionwereobtainedfrom[1]and[2].

Formoredetailssee[2].

ParallelizationandImplementationinJulia

ForsparseCholeskydecomposition,therearethreeprimaryopportunitiesfor

parallelization.First,indeterminingthestructureofL.Second,insimultaneously

computingcolumns,whichareindependent.Third,inturningtheadditionstepforeach

columnintoaparallelreduction.

InordertodeterminethestructureofL,eachcolumnorblockofcolumnscanbe

senttodifferentprocessors,andthenecessaryfill‐inscanbedetermined.Oncethefill‐ins

aredetermined,theycanbeaddedtoL.ThefollowingJuliafunctiondemonstratesthis:

functionfillinz(L)L=tril(L)L=spones(L)

refs=[remote_call((mod(i,nprocs()))+1,pfillz,L[:,i],i)|i=1:size(L)[1]fori=1:size(L)[1]q=fetch(refs[i])forj=1:length(q)/2L[q[2*j‐1],q[2*j]]=1endendreturnLend

TheinputtothisfunctionisthematrixA,forwhichwewouldliketheCholesky

decomposition.Thevector,refs,sendseachcolumntodifferentprocessors,andthefor‐

loopobtainstheresultsandaddsthefill‐ins.

Thetreestructureofdependenciesallowsforfurtherparallelization.Aspreviously

discussed,foramatrixA,thedependencytreemightlooklikethefollowingfigure:

Inthiscase,columns1,2,4,and6ofLcanallbedeterminedwithoutanyothercolumnsofL,

andtheycanbedeterminedsimultaneously.Thefollowingfor‐loop,whichispartofthe

mainsparseCholeskyfunction,performsthisprocessofgoingthroughthelevelsofthetree

andsendingallofthecolumnsatthesameleveltodifferentprocessors:

fori=1:size(kids)[1]k=[]forj=1:size(kids)[1]iftree[j]==i‐1k=vcat(k,[j])endendrefs=[remote_call(mod(i,nprocs())+1,spcholhelp,A,L,k[i],kids)|i=1:length(k)]form=1:length(refs)lcol=fetch(refs[m])L[:,k[m]]=lcolendend

Inthisforloop,theindexiloopsthroughallofthepossiblevaluesforthenumberof

childrenacolumncanhave.ThevectorrefscontainsallofthecolumnsofLthatare

calculatedinparallelforagivenstageofthetree.

Thefinallevelofparallelizationiswithinthefunctionwhichdeterminesthevalues

ofthecolumnsL.Duringthecalculation,anumberofmatricesequaltothenumberof

childrenofthegivencolumnmustbeadded.Ratherthanaddingserially,thisisdonewith

aparallelfor‐loop.Itisshownbelow:

addz=@parallel(+)fori=1:size(kids)[1]‐L[nzs,convert(Int16,kids[i])]*L[nzs,convert(Int16,kids[i])]'end

ResultsandDiscussion

Theprimaryobjectiveoftheprojectwastowriteafunction,whichperformsCholesky

decompositiononasparsematrix.Thisfunctioniscalledspchol(),andtheinputargument

isthematrix.Thisfunctionseemstoworkforallmatricestested.Averysimpleexampleis

shown:

julia>A10x10Float64Array:121.033.00.00.033.00.055.00.044.00.033.010.00.00.012.00.017.02.012.00.00.00.01.00.00.09.00.00.00.00.00.00.00.01.08.00.00.00.00.00.033.012.00.08.083.00.023.06.012.02.00.00.09.00.00.0162.018.063.00.00.055.017.00.00.023.018.038.018.020.04.00.02.00.00.06.063.018.054.00.03.044.012.00.00.012.00.020.00.017.00.00.00.00.00.02.00.04.03.00.014.0julia>u=spchol(A)10x10Float64Array:11.00.00.00.00.00.00.00.00.00.03.01.00.00.00.00.00.00.00.00.00.00.01.00.00.00.00.00.00.00.00.00.00.01.00.00.00.00.00.00.03.03.00.08.01.00.00.00.00.00.00.00.09.00.00.09.00.00.00.00.05.02.00.00.02.02.01.00.00.00.00.02.00.00.00.07.00.01.00.00.04.00.00.00.00.00.00.00.01.00.00.00.00.00.02.00.00.03.00.01.0julia>u*u'10x10Float64Array:

121.033.00.00.033.00.055.00.044.00.033.010.00.00.012.00.017.02.012.00.00.00.01.00.00.09.00.00.00.00.00.00.00.01.08.00.00.00.00.00.033.012.00.08.083.00.023.06.012.02.00.00.09.00.00.0162.018.063.00.00.055.017.00.00.023.018.038.018.020.04.00.02.00.00.06.063.018.054.00.03.044.012.00.00.012.00.020.00.017.00.00.00.00.00.02.00.04.03.00.014.0julia>A10x10Float64Array:121.033.00.00.033.00.055.00.044.00.033.010.00.00.012.00.017.02.012.00.00.00.01.00.00.09.00.00.00.00.00.00.00.01.08.00.00.00.00.00.033.012.00.08.083.00.023.06.012.02.00.00.09.00.00.0162.018.063.00.00.055.017.00.00.023.018.038.018.020.04.00.02.00.00.06.063.018.054.00.03.044.012.00.00.012.00.020.00.017.00.00.00.00.00.02.00.04.03.00.014.0

Above,Aandu*u’areidentical.

Testswereconductedtoseetheeffectsofparallelization.Allmatricesweregeneratedin

thesamemanner,byfirstdeclaringA=tril(round(10*sprand(n,n,.3))+eye(n));andthen

A=A*A’.Theresultsareshowninthetablebelow

Size Timeon1processor Timeon4processors Paralleltoserial

10x10 0.0025s .189s 76

50x50 0.101s .52s 5

100x100 2.8s 2.8s 1

150x150 24.5s 22.1s 0.9

Forsmallermatrices,itappearsasthoughserialisbetter,becausethecommunication

betweenprocessorsdominates.Asthesizeofmatricesincreases,parallelseemsto

substantiallyimproverelativetoserial.Moretestsneedtobeconductedinorderto

understandthebehaviorofthisfunction.

LimitationsandFurtherWork

Asstatedpreviously,moretestsneedtobeconductedinordertounderstandthe

performanceandlimitationsofthespchol()function.Additionally,amajorlimitationisthe

factthatalthoughthealgorithmisdesignedforsparsematrices,itcurrentlyonlyworksfor

fullsparsematrices(andofcoursefullmatrices).Thisisalsoapossiblecauseofthemajor

timeincreasewithrespecttomatrixdimensionsshownintheresults.Initiallythe

algorithmwasdevelopedthisway,becauseoferrorsinvolvingindexingcolumnsofmatrics

thatweredeclaredassparse.Whiletheseerrorswererecentlyfixed,thealgorithm

currentlystilldoesnotworkforsparsematricesunlesstheyaredeclaredasfull.Next

stepsinvolvedebuggingthis.

ThoughtsandQuestionsAboutJulia

Asstatedpreviously,currentlythespchol()functionpresentedinthisreport,treats

allofthematricesandvectorsasfull.Forexample,thefunctiontril()iscalledbyspchol():

functiontril(B)c=zeros(size(B)[1],size(B)[1])fori=1:size(B)[1]c[i:size(B)[1],i]=B[i:size(B)[1],i]endreturncend

ClearlythisfunctiontreatsBandcasdensematrices.Ibelievethisraisesseveral

questions.Firstly,ifsparsematrixfunctionsareimplementedinJulia,shouldtheybe

designedtoworkforbothdenseandsparsematrices?Forexample,iftrilassumedBwas

sparseandaccessedB.nzval,thenitcouldneverworkfordensematrices.Another

questionthisraisesiswhatwouldcauseafunctiondesignedfordensematricesnotwork

forasparsematrix?IfIrunthespchol()functiononasparsematrixthatisdeclaredasa

sparsematrix,Igetincorrectresultsorerrors.ItonlyworksifImakethesparsematrix

full.Thisisabitproblematic,andshouldbeaddressedinthefuture.Whileattemptingto

debugthis,Ifoundaninterestingresult:

julia>A10‐by‐10sparsematrixwith21nonzeros: [1, 1] =6.0 [2, 1] =9.0 [3, 1] =11.0 [7, 1] =1.0 [9, 1] =2.0 [2, 2] =1.0 [6, 2] =15.0 [3, 3] =1.0 [6, 3] =6.0 [10, 3] =5.0 [4, 4] =1.0 [8, 4] =3.0 [5, 5] =1.0 [6, 5] =6.0

[6, 6] =9.0 [7, 7] =1.0 [8, 7] =12.0 [8, 8] =3.0 [9, 8] =1.0 [9, 9] =9.0 [10, 10] =1.0julia>full(A*A')10x10Float64Array:36.054.066.00.00.00.06.00.012.00.054.082.099.00.00.015.09.00.018.00.066.099.0122.00.00.06.00.00.00.00.00.00.00.01.00.00.00.00.00.00.00.00.00.00.01.00.00.00.00.00.00.015.00.00.00.00.00.00.00.00.06.00.00.00.00.00.00.00.00.00.00.00.00.03.00.00.00.00.00.00.012.00.00.00.00.00.00.00.00.00.00.00.05.00.00.00.00.00.00.00.0julia>full(A)*full(A')10x10Float64Array:36.054.066.00.00.00.06.00.012.00.054.082.099.00.00.015.09.00.018.00.066.099.0122.00.00.06.011.00.022.05.00.00.00.01.00.00.00.03.00.00.00.00.00.00.01.06.00.00.00.00.00.015.06.00.06.0378.00.00.00.030.06.09.011.00.00.00.02.012.02.00.00.00.00.03.00.00.012.0162.03.00.012.018.022.00.00.00.02.03.086.00.00.00.05.00.00.030.00.00.00.026.0

Uponinspection,full(A*A’)isnotequalto(full(A))*(full(A))’.Isuspectthattheseshouldbe

equal,butsomethingseemstobecausingaproblem.

Conclusion

Thefunctionspchol()ispresentedinthisreport.Asstatedintheprevioussection,some

workandtestingstillneedstobedone.Iwouldbehappytocontributeinanyway

possible.

References

[1]JohnGilbert’sslidesfromDay1ofSparseMatrixDaysatMIT.Availableat

http://www.cs.ucsb.edu/~gilbert/talks/talks.htm.

[2]Liu,JosephW.H.“TheMultifrontalMethodforSparseMatrixSoluation:Theoryand

Practice.SIAMReview.Vol.34,No.1(Mar.,1992,pp.82‐109.

Acknowledgements

IwouldliketothankAlanEdelman,JeffBezanson,andViralShah.AdditionallyIwouldlike

tothankeveryonewhohasbeendevelopingtheJulialanguage.

AppendixA:Runningthecode

Allofcodeisfoundinspcholp.jandmustberunwith@everywhereload(“spcholp.j”).To

findtheCholeskydecompositionofA,runspchol(A).

AppendixB:TheCodeforspcholp.j:

functionspones(A)A=sparse(A)A.nzval=ones(length(A.nzval),)returnfull(A)endfunctiontril(B)c=zeros(size(B)[1],size(B)[1])fori=1:size(B)[1]c[i:size(B)[1],i]=B[i:size(B)[1],i]endreturncend@everywherefunctionpfillz(b,i)r=[]forj=1:(length(b))ifb[j]==1fork=1:(length(b))if(j>i)&&(k>j)&&b[k]==1r=vcat(r,[k,j])endendendendreturnrend

functionfillinz(L)L=tril(L)L=spones(L)refs=[remote_call((mod(i,nprocs()))+1,pfillz,L[:,i],i)|i=1:size(L)[1]]fori=1:size(L)[1]q=fetch(refs[i])forj=1:length(q)/2L[q[2*j‐1],q[2*j]]=1endendreturnLendfunctionnzindex(g)b=[]fori=1:length(g)ifg[i]!=0b=vcat(b,[i])endendreturnbendfunctionpary(L)u=size(L)par=zeros(u[1]‐1,1)form=1:(u[1]‐1)dad=nzindex(L[:,m])ifsize(dad)[1]==1par[m]=length(L[:,m])elsepar[m]=dad[2]endendreturnparend

functionkiddies(par)dim=length(par)+1kids=zeros(dim,dim)fori=1:length(par)kids[i,par[i]]=iendforj=1:size(kids)[1]p=nzindex(kids[:,j])fork=1:length(p)kids[:,j]=kids[:,j]+kids[:,p[k]]endendreturnkidsend

@everywherefunctionfrontconstruct(A,L,colj,kids)col=L[:,colj]nzs=nzindex(col)Uj=zeros(length(nzs),length(nzs))ifsize(kids)[1]!=0addz=@parallel(+)fori=1:size(kids)[1]‐L[nzs,convert(Int16,kids[i])]*L[nzs,convert(Int16,kids[i])]'endelseaddz=zeros(length(nzs),length(nzs))endFj=zeros(length(nzs),length(nzs))Fj[:,1]=A[nzs,colj]Fj[1,:]=A[colj,nzs]F=Fj+Uj+addzalpha=sqrt(F[1,1]);r=F[:,1]lz=vcat(alpha,(1/alpha)*r[2:length(r)])returnlzend

@everywherefunctionspcholhelp(A,L,i,kids)kidset=nonzeros(kids[:,i])iflength(kidset)==0kidset=[]endlz=frontconstruct(A,L,i,kidset)forj=1:size(A)[1]ifL[j,i]==1L[j,i]=lz[1]lz=lz[2:length(lz)]endendreturnL[:,i]end

functionspchol(A)B=AL=fillinz(B)par=pary(L)kids=kiddies(par)tree=zeros(size(kids)[1])fori=1:size(kids)[1]tree[i]=length(nonzeros(kids[:,i]))endfori=1:size(kids)[1]k=[]forj=1:size(kids)[1]iftree[j]==i‐1k=vcat(k,[j])endendrefs=[remote_call(mod(i,nprocs())+1,spcholhelp,A,L,k[i],kids)|i=1:length(k)]form=1:length(refs)lcol=fetch(refs[m])L[:,k[m]]=lcolendendreturnLend

parallel cholesky decomposition in julia (6.338...

Documents

numerical methods part: cholesky and decomposition...

the nasa/industry design analysis methods for …€¦ ·...

cholesky stochastic...

research article usage of cholesky decomposition in order to...

chapter 13 cholesky decomposition techniques in electronic...

simultaneous modelling of the cholesky decomposition of...

cholesky and ldlt decomposition - mathematics for...

simultaneous modelling of the cholesky decomposition of ...a...

· vectors, then it presents some numerical algorithms,...

mathematics 18.337, computer science 6.338, sma...

cholesky decomposition for the vasicek interest rate...

multiple linear regression using cholesky...

finite element method for subsurface hydrology using a ......

cholesky decomposition in mmse mimo

hierarchical sparse cholesky decomposition with …property...

simultaneous modelling of the cholesky decomposition of...

cholesky swarm overview

evaluation and fpga implementation of sparse linear ... and...

cholesky decomposition - rosetta code

a hybrid cholesky decomposition algorithm for multicore cpus...