parallel cholesky decomposition in julia (6.338...
Post on 04-Oct-2020
13 Views
Preview:
TRANSCRIPT
ParallelCholeskyDecompositioninJulia(6.338Project)
OmarMysore
December16,2011
ProjectIntroduction
Sparsematricesdominatenumericalsimulationandtechnicalcomputing.Their
applicationsarepracticallylimitless,rangingfromsolvingpartialdifferentialequationsto
convexoptimizationtobigdata.Becauseofthewideuseofsparsematrices,theobjective
ofthisprojectwastoimplementsparseCholeskydecompositioninparallelusingtheJulia
language.InadditiontodevelopingtheCholeskydecompositionfeatureintheJulia
languageandinvestigatingtheeffectsofparallelization,thisprojectservedthepurposeof
aidingthedevelopmentofsparsematrixsupportinJulia.
AworkingsparseparallelCholeskydecompositionsolverwasdeveloped.Although,
thecurrentimplementationcontainslimitsintermsofspeedandcapability,itishopedthat
thisimplementationcanserveasameanstofurtherdevelopment.Theremainderofthis
reportwilldiscussthebasicsofCholeskydecompositionandthealgorithmused,the
opportunitiesandmethodsofparallelization,theresults,andthepotentialforfuturework.
CholeskyDecomposition
TheCholeskydecompositionofasymmetricpositivedefinitematrixAdeterminesthe
lower‐triangularmatrixL,whereLLT=A.Althoughitislimitedtosymmetricpositive
definitematrices,thesematricesoftenappearinfieldssuchasconvexoptimization.The
followingequationsareusedfrom[2].
ThebasicdenseCholeskydecompositionalgorithmconsistsofrepeatedly
performingthefactorizationbelowonamatrix,Aofsizen,wheredisaconstantvalueand
dandvisann‐1by1columnvector.
Oncethisfactorizationiscompleted,thefirstcolumnofLisdetermined.Thenthesame
factorizationisperformedonC‐vv/d,andtheprocessisrepeateduntilallofthecolumnsof
Larefound.Similarly,thesameprocesscanbedoneforblockmatrices:
CholeskyDecompositionforSparseMatrices
Forsparsematrices,severaladditionalstepsaretakeninordertotakeadvantageofthe
substantiallyfewernonzerovalues.Firstthefill‐ins,andthestructureofLaredetermine,
withoutcalculatingthevalues.Nextthetreeofdependenciesisfound,andfinallythe
valuesofLarecalculated.Foradetailedexplanationofthemethodsummarizedinthis
report,see[2].
ThefirststepinsparseCholeskydecompositionistodeterminethestructureofL
withoutexplitlydeterminingL.AllofthefollowingimagesareusedfromJohnGilbert’s
slides[1].SupposeAhasthestructurebelow:
Then,thegoalwouldbetodeterminethestructureofL,whichisshowbelow:
Thereddotsareknownasthefill‐ins,sincethesevaluesarenonzeroinL,butzeroinA.
AlthoughthestructureofLisknown,noneofthespecificvaluesareknown.Inorderto
determinethestructureofL,thegraphsofthematricesareused.Belowisthegraphofthe
previouslyshownmatrixA:
Fromthisgraph,thegraphofLcaneasilybedeterminedbyconnectingthehigher
numberedneighborsofeachnode/columninthegraph.BelowisthegraphofLwiththe
fill‐insinred:
OncethestructureofLisdetermined,thedependencytreecanbedetermined.Inorderto
calculatethedependencytree,theparentofeachnodemustbedetermined,andtheparent
ofagivennode/columnistheminimumrowvalueofanonzerovalueinthegivencolumn
ofL,notincludingthediagonalvalues.Forexample,forthegraphofLpreviouslyshown,
thedependencytreeisshownbelow:
Afterthedependencytreeisdetermined,thevaluesofLcanbecalculated.Thebasic
equationsfordeterminingeachcolumnofLareshownbelow:
ForeachcolumnjinA
DeterminethefrontalmatrixFj,where
And,
Here,T[j]‐{j}representallofchildrennodes,whichhavetoalready
havebeendeterminedinordertocalculateLj.
ThisisthebasicformulationfordeterminingthesparseCholeskydecomposition.All
equationsandimagesusedinthisandtheprevioussectionwereobtainedfrom[1]and[2].
Formoredetailssee[2].
ParallelizationandImplementationinJulia
ForsparseCholeskydecomposition,therearethreeprimaryopportunitiesfor
parallelization.First,indeterminingthestructureofL.Second,insimultaneously
computingcolumns,whichareindependent.Third,inturningtheadditionstepforeach
columnintoaparallelreduction.
InordertodeterminethestructureofL,eachcolumnorblockofcolumnscanbe
senttodifferentprocessors,andthenecessaryfill‐inscanbedetermined.Oncethefill‐ins
aredetermined,theycanbeaddedtoL.ThefollowingJuliafunctiondemonstratesthis:
functionfillinz(L)L=tril(L)L=spones(L)
refs=[remote_call((mod(i,nprocs()))+1,pfillz,L[:,i],i)|i=1:size(L)[1]fori=1:size(L)[1]q=fetch(refs[i])forj=1:length(q)/2L[q[2*j‐1],q[2*j]]=1endendreturnLend
TheinputtothisfunctionisthematrixA,forwhichwewouldliketheCholesky
decomposition.Thevector,refs,sendseachcolumntodifferentprocessors,andthefor‐
loopobtainstheresultsandaddsthefill‐ins.
Thetreestructureofdependenciesallowsforfurtherparallelization.Aspreviously
discussed,foramatrixA,thedependencytreemightlooklikethefollowingfigure:
Inthiscase,columns1,2,4,and6ofLcanallbedeterminedwithoutanyothercolumnsofL,
andtheycanbedeterminedsimultaneously.Thefollowingfor‐loop,whichispartofthe
mainsparseCholeskyfunction,performsthisprocessofgoingthroughthelevelsofthetree
andsendingallofthecolumnsatthesameleveltodifferentprocessors:
fori=1:size(kids)[1]k=[]forj=1:size(kids)[1]iftree[j]==i‐1k=vcat(k,[j])endendrefs=[remote_call(mod(i,nprocs())+1,spcholhelp,A,L,k[i],kids)|i=1:length(k)]form=1:length(refs)lcol=fetch(refs[m])L[:,k[m]]=lcolendend
Inthisforloop,theindexiloopsthroughallofthepossiblevaluesforthenumberof
childrenacolumncanhave.ThevectorrefscontainsallofthecolumnsofLthatare
calculatedinparallelforagivenstageofthetree.
Thefinallevelofparallelizationiswithinthefunctionwhichdeterminesthevalues
ofthecolumnsL.Duringthecalculation,anumberofmatricesequaltothenumberof
childrenofthegivencolumnmustbeadded.Ratherthanaddingserially,thisisdonewith
aparallelfor‐loop.Itisshownbelow:
addz=@parallel(+)fori=1:size(kids)[1]‐L[nzs,convert(Int16,kids[i])]*L[nzs,convert(Int16,kids[i])]'end
ResultsandDiscussion
Theprimaryobjectiveoftheprojectwastowriteafunction,whichperformsCholesky
decompositiononasparsematrix.Thisfunctioniscalledspchol(),andtheinputargument
isthematrix.Thisfunctionseemstoworkforallmatricestested.Averysimpleexampleis
shown:
julia>A10x10Float64Array:121.033.00.00.033.00.055.00.044.00.033.010.00.00.012.00.017.02.012.00.00.00.01.00.00.09.00.00.00.00.00.00.00.01.08.00.00.00.00.00.033.012.00.08.083.00.023.06.012.02.00.00.09.00.00.0162.018.063.00.00.055.017.00.00.023.018.038.018.020.04.00.02.00.00.06.063.018.054.00.03.044.012.00.00.012.00.020.00.017.00.00.00.00.00.02.00.04.03.00.014.0julia>u=spchol(A)10x10Float64Array:11.00.00.00.00.00.00.00.00.00.03.01.00.00.00.00.00.00.00.00.00.00.01.00.00.00.00.00.00.00.00.00.00.01.00.00.00.00.00.00.03.03.00.08.01.00.00.00.00.00.00.00.09.00.00.09.00.00.00.00.05.02.00.00.02.02.01.00.00.00.00.02.00.00.00.07.00.01.00.00.04.00.00.00.00.00.00.00.01.00.00.00.00.00.02.00.00.03.00.01.0julia>u*u'10x10Float64Array:
121.033.00.00.033.00.055.00.044.00.033.010.00.00.012.00.017.02.012.00.00.00.01.00.00.09.00.00.00.00.00.00.00.01.08.00.00.00.00.00.033.012.00.08.083.00.023.06.012.02.00.00.09.00.00.0162.018.063.00.00.055.017.00.00.023.018.038.018.020.04.00.02.00.00.06.063.018.054.00.03.044.012.00.00.012.00.020.00.017.00.00.00.00.00.02.00.04.03.00.014.0julia>A10x10Float64Array:121.033.00.00.033.00.055.00.044.00.033.010.00.00.012.00.017.02.012.00.00.00.01.00.00.09.00.00.00.00.00.00.00.01.08.00.00.00.00.00.033.012.00.08.083.00.023.06.012.02.00.00.09.00.00.0162.018.063.00.00.055.017.00.00.023.018.038.018.020.04.00.02.00.00.06.063.018.054.00.03.044.012.00.00.012.00.020.00.017.00.00.00.00.00.02.00.04.03.00.014.0
Above,Aandu*u’areidentical.
Testswereconductedtoseetheeffectsofparallelization.Allmatricesweregeneratedin
thesamemanner,byfirstdeclaringA=tril(round(10*sprand(n,n,.3))+eye(n));andthen
A=A*A’.Theresultsareshowninthetablebelow
Size Timeon1processor Timeon4processors Paralleltoserial
ratio
10x10 0.0025s .189s 76
50x50 0.101s .52s 5
100x100 2.8s 2.8s 1
150x150 24.5s 22.1s 0.9
Forsmallermatrices,itappearsasthoughserialisbetter,becausethecommunication
betweenprocessorsdominates.Asthesizeofmatricesincreases,parallelseemsto
substantiallyimproverelativetoserial.Moretestsneedtobeconductedinorderto
understandthebehaviorofthisfunction.
LimitationsandFurtherWork
Asstatedpreviously,moretestsneedtobeconductedinordertounderstandthe
performanceandlimitationsofthespchol()function.Additionally,amajorlimitationisthe
factthatalthoughthealgorithmisdesignedforsparsematrices,itcurrentlyonlyworksfor
fullsparsematrices(andofcoursefullmatrices).Thisisalsoapossiblecauseofthemajor
timeincreasewithrespecttomatrixdimensionsshownintheresults.Initiallythe
algorithmwasdevelopedthisway,becauseoferrorsinvolvingindexingcolumnsofmatrics
thatweredeclaredassparse.Whiletheseerrorswererecentlyfixed,thealgorithm
currentlystilldoesnotworkforsparsematricesunlesstheyaredeclaredasfull.Next
stepsinvolvedebuggingthis.
ThoughtsandQuestionsAboutJulia
Asstatedpreviously,currentlythespchol()functionpresentedinthisreport,treats
allofthematricesandvectorsasfull.Forexample,thefunctiontril()iscalledbyspchol():
functiontril(B)c=zeros(size(B)[1],size(B)[1])fori=1:size(B)[1]c[i:size(B)[1],i]=B[i:size(B)[1],i]endreturncend
ClearlythisfunctiontreatsBandcasdensematrices.Ibelievethisraisesseveral
questions.Firstly,ifsparsematrixfunctionsareimplementedinJulia,shouldtheybe
designedtoworkforbothdenseandsparsematrices?Forexample,iftrilassumedBwas
sparseandaccessedB.nzval,thenitcouldneverworkfordensematrices.Another
questionthisraisesiswhatwouldcauseafunctiondesignedfordensematricesnotwork
forasparsematrix?IfIrunthespchol()functiononasparsematrixthatisdeclaredasa
sparsematrix,Igetincorrectresultsorerrors.ItonlyworksifImakethesparsematrix
full.Thisisabitproblematic,andshouldbeaddressedinthefuture.Whileattemptingto
debugthis,Ifoundaninterestingresult:
julia>A10‐by‐10sparsematrixwith21nonzeros: [1, 1] =6.0 [2, 1] =9.0 [3, 1] =11.0 [7, 1] =1.0 [9, 1] =2.0 [2, 2] =1.0 [6, 2] =15.0 [3, 3] =1.0 [6, 3] =6.0 [10, 3] =5.0 [4, 4] =1.0 [8, 4] =3.0 [5, 5] =1.0 [6, 5] =6.0
[6, 6] =9.0 [7, 7] =1.0 [8, 7] =12.0 [8, 8] =3.0 [9, 8] =1.0 [9, 9] =9.0 [10, 10] =1.0julia>full(A*A')10x10Float64Array:36.054.066.00.00.00.06.00.012.00.054.082.099.00.00.015.09.00.018.00.066.099.0122.00.00.06.00.00.00.00.00.00.00.01.00.00.00.00.00.00.00.00.00.00.01.00.00.00.00.00.00.015.00.00.00.00.00.00.00.00.06.00.00.00.00.00.00.00.00.00.00.00.00.03.00.00.00.00.00.00.012.00.00.00.00.00.00.00.00.00.00.00.05.00.00.00.00.00.00.00.0julia>full(A)*full(A')10x10Float64Array:36.054.066.00.00.00.06.00.012.00.054.082.099.00.00.015.09.00.018.00.066.099.0122.00.00.06.011.00.022.05.00.00.00.01.00.00.00.03.00.00.00.00.00.00.01.06.00.00.00.00.00.015.06.00.06.0378.00.00.00.030.06.09.011.00.00.00.02.012.02.00.00.00.00.03.00.00.012.0162.03.00.012.018.022.00.00.00.02.03.086.00.00.00.05.00.00.030.00.00.00.026.0
Uponinspection,full(A*A’)isnotequalto(full(A))*(full(A))’.Isuspectthattheseshouldbe
equal,butsomethingseemstobecausingaproblem.
Conclusion
Thefunctionspchol()ispresentedinthisreport.Asstatedintheprevioussection,some
workandtestingstillneedstobedone.Iwouldbehappytocontributeinanyway
possible.
References
[1]JohnGilbert’sslidesfromDay1ofSparseMatrixDaysatMIT.Availableat
http://www.cs.ucsb.edu/~gilbert/talks/talks.htm.
[2]Liu,JosephW.H.“TheMultifrontalMethodforSparseMatrixSoluation:Theoryand
Practice.SIAMReview.Vol.34,No.1(Mar.,1992,pp.82‐109.
Acknowledgements
IwouldliketothankAlanEdelman,JeffBezanson,andViralShah.AdditionallyIwouldlike
tothankeveryonewhohasbeendevelopingtheJulialanguage.
AppendixA:Runningthecode
Allofcodeisfoundinspcholp.jandmustberunwith@everywhereload(“spcholp.j”).To
findtheCholeskydecompositionofA,runspchol(A).
AppendixB:TheCodeforspcholp.j:
functionspones(A)A=sparse(A)A.nzval=ones(length(A.nzval),)returnfull(A)endfunctiontril(B)c=zeros(size(B)[1],size(B)[1])fori=1:size(B)[1]c[i:size(B)[1],i]=B[i:size(B)[1],i]endreturncend@everywherefunctionpfillz(b,i)r=[]forj=1:(length(b))ifb[j]==1fork=1:(length(b))if(j>i)&&(k>j)&&b[k]==1r=vcat(r,[k,j])endendendendreturnrend
functionfillinz(L)L=tril(L)L=spones(L)refs=[remote_call((mod(i,nprocs()))+1,pfillz,L[:,i],i)|i=1:size(L)[1]]fori=1:size(L)[1]q=fetch(refs[i])forj=1:length(q)/2L[q[2*j‐1],q[2*j]]=1endendreturnLendfunctionnzindex(g)b=[]fori=1:length(g)ifg[i]!=0b=vcat(b,[i])endendreturnbendfunctionpary(L)u=size(L)par=zeros(u[1]‐1,1)form=1:(u[1]‐1)dad=nzindex(L[:,m])ifsize(dad)[1]==1par[m]=length(L[:,m])elsepar[m]=dad[2]endendreturnparend
functionkiddies(par)dim=length(par)+1kids=zeros(dim,dim)fori=1:length(par)kids[i,par[i]]=iendforj=1:size(kids)[1]p=nzindex(kids[:,j])fork=1:length(p)kids[:,j]=kids[:,j]+kids[:,p[k]]endendreturnkidsend
@everywherefunctionfrontconstruct(A,L,colj,kids)col=L[:,colj]nzs=nzindex(col)Uj=zeros(length(nzs),length(nzs))ifsize(kids)[1]!=0addz=@parallel(+)fori=1:size(kids)[1]‐L[nzs,convert(Int16,kids[i])]*L[nzs,convert(Int16,kids[i])]'endelseaddz=zeros(length(nzs),length(nzs))endFj=zeros(length(nzs),length(nzs))Fj[:,1]=A[nzs,colj]Fj[1,:]=A[colj,nzs]F=Fj+Uj+addzalpha=sqrt(F[1,1]);r=F[:,1]lz=vcat(alpha,(1/alpha)*r[2:length(r)])returnlzend
@everywherefunctionspcholhelp(A,L,i,kids)kidset=nonzeros(kids[:,i])iflength(kidset)==0kidset=[]endlz=frontconstruct(A,L,i,kidset)forj=1:size(A)[1]ifL[j,i]==1L[j,i]=lz[1]lz=lz[2:length(lz)]endendreturnL[:,i]end
functionspchol(A)B=AL=fillinz(B)par=pary(L)kids=kiddies(par)tree=zeros(size(kids)[1])fori=1:size(kids)[1]tree[i]=length(nonzeros(kids[:,i]))endfori=1:size(kids)[1]k=[]forj=1:size(kids)[1]iftree[j]==i‐1k=vcat(k,[j])endendrefs=[remote_call(mod(i,nprocs())+1,spcholhelp,A,L,k[i],kids)|i=1:length(k)]form=1:length(refs)lcol=fetch(refs[m])L[:,k[m]]=lcolendendreturnLend
top related