recent progress in autotuning at berkeley
TRANSCRIPT
RecentProgressinAutotuningatBerkeley
JimDemmelUCBerkeley
bebop.cs.berkeley.eduparlab.eecs.berkeley.edu
Outline
• Introduc?ontoParLab• Summaryofautotuningac?vi?es
– AlsoShoaibKamil’stalkonThursday@10:15
• DenseLinearAlgebra• SparseLinearAlgebra(summary)
• Communica?oncollec?ves
• Responsestoques?onsfromtheorganizers
OverviewoftheParallelLaboratoryKrsteAsanovic,RasBodik,JimDemmel,
TomKeaveny,KurtKeutzer,JohnKubiatowicz,EdwardLee,NelsonMorgan,GeorgeNecula,DavePaWerson,
KoushikSen,JohnWawrzynek,DavidWessel,KathyYelick
parlab.eecs.berkeley.edu
4
How do compelling apps relate to 13 motifs?
“Mo?f"Popularity (RedHot→
5
How do compelling apps relate to 13 motifs?
6
How do compelling apps relate to 13 motifs?
7
Personal Health
Image Retrieval
Hearing, Music Speech Parallel
Browser Motifs
Sketching
Legacy Code Schedulers Communication &
Synch. Primitives Efficiency Language Compilers
ParLabResearchOverviewEasy to write correct programs that run efficiently on manycore
Legacy OS
Multicore/GPGPU
OS Libraries & Services
RAMP Manycore
Hypervisor
Cor
rect
ness
Composition & Coordination Language (C&CL)
Parallel Libraries
Parallel Frameworks
Static Verification
Dynamic Checking
Debugging with Replay
Directed Testing
Autotuners
C&CL Compiler/Interpreter
Efficiency Languages
Type Systems
Dia
gnos
ing
Pow
er/P
erfo
rman
ce
8
Personal Health
Image Retrieval
Hearing, Music Speech Parallel
Browser Motifs
Sketching
Legacy Code Schedulers Communication &
Synch. Primitives Efficiency Language Compilers
ParLabResearchOverviewEasy to write correct programs that run efficiently on manycore
Legacy OS
Multicore/GPGPU
OS Libraries & Services
RAMP Manycore
Hypervisor
Cor
rect
ness
Composition & Coordination Language (C&CL)
Parallel Libraries
Parallel Frameworks
Static Verification
Dynamic Checking
Debugging with Replay
Directed Testing
Autotuners
C&CL Compiler/Interpreter
Efficiency Languages
Type Systems
Dia
gnos
ing
Pow
er/P
erfo
rman
ce
SummaryofAutotuning(1/2)
• Denselinearalgebra– NewalgorithmsthataWainlowerboundsoncommunica?on(parallelandsequen?al)
– NewalgorithmsforGPUs
• Sparselinearalgebra(summary)– NewalgorithmsthataWainlowerboundsoncomm.
• Collec?vecommunica?ons– Choosingrighttreeforreduc?ons/broadcasts/etc
SummaryofAutotuning(2/2)• To be presented by Shoaib Kamil on Thursday
• Recentworkautotuningthreeparallelkernels– SparseMatrixVectorMul?ply(SpMV)– LaeceBoltzmannMHD
– Stencils(HeatEqua?on)• Lessonslearned/commonali?esbetweentheautotuners
• Towardsaframeworkforbuildingautotuners– Whatistheroleofthecompiler?
Mo?va?on• Running?meofanalgorithmissumof3terms:
• #flops*?me_per_flop• #wordsmoved/bandwidth
• #messages*latency
• Exponen?allygrowinggapsbetween• Time_per_flop<<1/NetworkBW<<NetworkLatency
• Improving59%/yearvs26%/yearvs15%/year
• Time_per_flop<<1/MemoryBW<<MemoryLatency• Improving59%/yearvs23%/yearvs5.5%/year
communica?on
Mo?va?on• Running?meofanalgorithmissumof3terms:
• #flops*?me_per_flop• #wordsmoved/bandwidth
• #messages*latency
• Exponen?allygrowinggapsbetween• Time_per_flop<<1/NetworkBW<<NetworkLatency
• Improving59%/yearvs26%/yearvs15%/year
• Time_per_flop<<1/MemoryBW<<MemoryLatency• Improving59%/yearvs23%/yearvs5.5%/year
• Goal:reorganize/tunemo?fstoavoidcommunica?on• Not justhidingcommunica?on(speedup≤2x)
• Arbitraryspeedupspossibleforlinearalgebra• Successmetric–aWainlowerboundsoncommunica?on
communica?on
LinearAlgebraCollaborators(sofar)• UCBerkeley
– KathyYelick,MingGu– MarkHoemmen,MarghoobMohiyuddin,KaushikDaWa,GeorgePetropoulos,
SamWilliams,BeBOpgroup– LennyOliker,JohnShalf– ParLab/UPCRC–newparallelcompu?ngresearchcenterfundedbyInteland
Microsop• CUDenver
– JulienLangou• INRIA
– LauraGrigori,HuaXiang• GeorgiaTech
– RichVuduc
• Densework⊆ongoingdevelopmentofSca/LAPACK– JointwithUTK
Communica?onLowerBoundsforDenseLinearAlgebra(1/2)
• Hong‐Kung(1981),Irony/Tishkin/Toledo(2004)• Usualmatmulusing2n3flops• Sequen?alcase,withfastmemoryofsizeW<3n2
– Thm:#wordsmoved=Ω(n3/W1/2)
• Parallelcase,Pprocs,O(n2/P)memory/proc– Thm:#wordsmoved=Ω(n2/P1/2)
• Bandwidthlowerbound⇒latencylowerbound– #messages≥#wordsmoved/(fast,local)memorysize
– Sequen?al:#messages=Ω(n3/W3/2)
– Parallel:#messages=Ω(P1/2)
Communica?onLowerBoundsforDenseLinearAlgebra(2/2)
• SamelowerboundsapplytoLUandQR– Assump?on:O(n3)algorithms;LUiseasybutQRissubtle
• LAPACKandScaLAPACKdonotaWainthesebounds– ScaLAPACKaWainsbandwidthbound
– ButsendsO((mn/P)1/2)?mesmoremessages– LAPACKaWainsneither;O((m2/W)1/2)xmorewordsmoved
• ButnewalgorithmsdoaWainthem,modpolylogfactors– ParallelQR:requiresnewrepresenta?onofQ,redundantflops– Sequen?alQR:canusenewone,orrecursive– ParallelLU:newpivo?ngstrategyneeded,stableinprac?ce– Sequen?alLU:canusenewoneorrecursive,buthigherlatency
MinimizingCommunica?oninParallelQR
W=
W0 W1 W2 W3
R00 R10 R20 R30
R01
R11
R02
• QRdecomposi?onofmxnmatrixW,m>>n• TSQR=“TallSkinnyQR”• Pprocessors,blockrowlayout
• UsualParallelAlgorithm• ComputeHouseholdervectorforeachcolumn• Numberofmessages∝nlogP
• Communica?onAvoidingAlgorithm• Reduc?onopera?on,withQRasoperator• Numberofmessages∝logP‐op?mal
TSQRinmoredetail
Qisrepresentedimplicitlyasaproduct(treeoffactors)
MinimizingCommunica?oninTSQR
W=
W0 W1 W2 W3
R00 R10 R20 R30
R01
R11
R02 Parallel:
W=
W0 W1 W2 W3
R01 R02
R00
R03 Sequen?al:
W=
W0 W1 W2 W3
R00 R01
R01 R11
R02
R11 R03
DualCore:
Choosereduc?ontreedynamically
Mul?core/Mul?socket/Mul?rack/Mul?site/Out‐of‐core:?
PerformanceofTSQRvsSca/LAPACK• Parallel
– Pen?umIIIcluster,DolphinInterconnect,MPICH• Upto6.7xspeedup(16procs,100Kx200)
– BlueGene/L• Upto4xspeedup(32procs,1Mx50)
– BothuseElmroth‐Gustavsonlocally–enabledbyTSQR• Sequen?al
– OOConPowerPClaptop• AsliWleas2xslowdownvs(predicted)infiniteDRAM
• SeeUCBerkeleyEECSTechReport2008‐74• Beingrevisedtoaddop?malityresults!
• GeneralrectangularQR• TSQRforpanelfactoriza?on,thenright‐looking• Implementa?onunderway(JulienLangou)
ModeledSpeedupsofCAQRvsScaLAPACK
Petascaleupto22.9x
IBMPower5upto9.7x
“Grid”upto11x
Petascalemachinewith8192procs,eachat500GFlops/s,abandwidthof4GB/s.
TSLU:LUfactoriza?onofatallskinnymatrix
Firsttrytheobviousgeneraliza?onofTSQR:
GrowthfactorforTSLUbasedfactoriza?on
UnstableforlargePandlargematrices.WhenP=#rows,TSLUisequivalenttoparallelpivo?ng.
CourtesyofH.Xiang
MakingTSLUStable
• Ateachnodeintree,TSLUselectsbpivotrowsfrom2bcandidatesfromits2childnodes
• Ateachnode,doLUon2boriginalrowsselectedbychildnodes,notUfactorsfromchildnodes
• WhenTSLUdone,permutebselectedrowstotopoforiginalmatrix,redobstepsofLUwithoutpivo?ng
• CALU–Communica?onAvoidingLUforgeneralA– UseTSLUforpanelfactoriza?ons– Applytorestofmatrix– Cost:redundantpanelfactoriza?ons
• Benefit:– Stableinprac?ce,butnot samepivotchoiceasGEPP– b?mesfewermessagesoverall‐faster
GrowthfactorforbeWerCALUapproach
Likethresholdpivo?ngwithworstcasethreshold=.33,so|L|<=3Tes?ngshowsaboutsameresidualasGEPP
PerformancevsScaLAPACK• TSLU
– IBMPower5• Upto4.37xfaster(16procs,1Mx150)
– CrayXT4• Upto5.52xfaster(8procs,1Mx150)
• CALU– IBMPower5
• Upto2.29xfaster(64procs,1000x1000)– CrayXT4
• Upto1.81xfaster(64procs,1000x1000)• Op?malityanalysisanalogoustoQR• SeeINRIATechReport6523(2008)
Petascalemachinewith8192procs,eachat500GFlops/s,abandwidthof4GB/s.
Speeduppredic?onforaPetascalemachine‐upto81xfasterP=8192
RelatedWorkandContribu?onsforDenseLinearAlgebra
• Relatedwork– Pothen&Raghavan(1989)– Toledo(1997),Rabani&Toledo(2001)– Dunha,Becker&PaWerson(2002)– Gunter&vandeGeijn(2005)
• Ourcontribu?ons– QR:2Dparallel,efficient1DparallelQR– Unifiedapproachtosequen?al,parallel,…– ParallelLU– Communica?onlowerbounds,provingop?mality
28
Heterogeneous(CPU+GPU)PerformanceLibraries
VasilyVolkov
ChallengesinprogrammingGPUs
29
Relatively little tuning information for GPU hardware - Hardware exposed via abstract programming model - Hard to develop accurate performance models - Hardware changing rapidly
We get ~2x speedups vs. best other work on variety of libraries - Matrix-matrix multiply (1.6x speedup) - FFT (2x speedup) - LU/Cholesky factorization (3x/2x speedups) - Tridiagonal eigenvalue solver (2x speedup)
Example: partial pivoting in LU factorization on NVIDIA GPU - Ad: “GPU supports gather/scatter with arbitrary access patterns” - LAPACK code (sgetrf) spends 50% of time doing pivoting - Gets worse with larger matrices - Only 1% of time is spent in pivoting if matrix is transposed
- 2x speedup!
Matrix‐matrixmul?ply(SGEMM)
30
- Uses software prefetching, strip-mining, register blocking - Resembles algorithms used for Cray X1 and IBM 3090 VF - 60% of peak, bound by access to on-chip shared memory
FFT,complex‐to‐complex,batched
31
Results as on NVIDIA GeForce 8800GTX Resembles algorithms used for vector processors Radix-8 in registers, transposes using shared memory
0
50
100
150
200
250
1 8 64 512 4096 32768
benc
hFFT
Gflo
p/s
N
registerlstorage
CUFFT1.1
ourbest
HeterogeneousCompu?ng
32
Question: should you port entire applications to GPU? – Maybe not
Example: Eigenvalue solver (bisection, LAPACK’s sstebz) - Most work (O(n2)) in vectorized, embarrassingly parallel code
- Do it on GPU and finely tune - May do many redundant flops on GPU to go faster
- multisection - Run on CPU when problem too small for GPU - Use performance model to choose CPU or GPU version
- Little work (O(n)) in nearly serial code - Do it on CPU: time-consuming and non-trivial to port
Independent project: - Did everything on the GPU - Used advanced parallel algorithms for O(n) work - Up to 2x slower running on faster GPU (compared to us)
BreakdownofRun?me(sstebz)
33
“Count(x)” is the O(n2) work, use either SSE or GPU “the rest” is the O(n) work, isn’t worth optimizing
LUFactoriza?on
34
LU factorization: - O(n3) work in bulky BLAS3 operations - O(n2) work in fine-grain BLAS1 / BLAS2 (panel factorization)
Panel factorization is usually faster on the CPU - Peak sustained bandwidth of 8800GTX is 75GB/s - Kernel launch overhead is 5µs - BLAS1 and BLAS2 are bandwidth bound, so run at
- compare vs. CPU handicapped by CPU-GPU transfers - Faster on CPU up to n ~ 16K on NVIDIA 8800GTX
OverallPerformance
35
OverlapinLUfactoriza?on
36
Overlap panel factorization on CPU with sgemm on GPU
LUslowdownsfromomiengop?miza?ons
SummaryofGPUExperience• Heterogeneityexpandstuningspace
– DividingworkbetweenCPUandGPU• Dependsalotonflop‐rate/bandwidth/latency/startup
– DifferentdatalayoutsmaybebestforCPUandGPU• RowwisevscolumnwiseforLU
– Maywanttodo(many)extraflopsonGPU• Mul?sec?onvsBisec?onineigensolver
• TRINV+TRMMvsTRSMinLU
• Rapidevolu?onofGPUarchitecturesmo?vatesautotuning
Whyallourproblemsaresolvedfordenselinearalgebra–intheory
• Thm(D.,Dumitriu,Holtz,Kleinberg)(Numer.Math.2007)– GivenanymatmulrunninginO(nω)opsforsomeω>2,itcanbe
madestableands?llruninO(nω+ε)ops,foranyε>0.
• Currentrecord:ω≈2.38• Thm(D.,Dumitriu,Holtz)(Numer.Math.2008)
– GivenanystablematmulrunninginO(nω+ε)ops,candobackwardstabledenselinearalgebrainO(nω+ε)ops:
• GEPP,QR(recursive)• rankrevealingQR(randomized)• (Generalized)Schurdecomposi?on,SVD(randomized)
• Alsoreducescommunica?ontoO(nω+ε)• Butconstants?
39
AvoidingCommunica?oninSparseLinearAlgebra‐Summary
• TakekstepsofKrylovsubspacemethod– GMRES,CG,Lanczos,Arnoldi– Assumematrix“well‐par??oned,”withmodestsurface‐to‐volumera?o
– Parallelimplementa?on• Conven?onal:O(klogp)messages• “New”:O(logp)messages‐op?mal
– Serialimplementa?on• Conven?onal:O(k)movesofdatafromslowtofastmemory• “New”:O(1)movesofdata–op?mal
• Canincorporatesomeprecondi?oners– Hierarchical,semiseparablematrices…
• Lotsofspeeduppossible(modeledandmeasured)– Price:someredundantcomputa?on
TuningCollec?ves• RajeshNishtala
• Collec?vescalledbyallthreadstoperformgloballycommunica?on– Broadcase,reduc?on,all‐to‐all
• Needtotunebecausebestalgorithmdependson– #procs,– Networklatency– Networkbandwidth,– Transfersize,etc.
• FocusonPGASlanguagesandone‐sidedcommunica?onmodels– E.g.UPC,Titanium,Co‐ArrayFortran
41
ExampleTreeTopologies
Radix4k‐nomialtree(quadnomial)
Radix2k‐nomialtree(binomial)
42BinaryTree ForkTree
43
DistributedMemoryBroadcast
• 100016Bytebroadcasts• Bestimplementa?onis
pla�ormspecific• Lowcosttoinject
message⇒wantflattree(6‐ary)
• Highcost⇒wantdeeptreetoparallelizetheoverhead(binomial)
GOOD
44
DistributedMemoryAll‐to‐All
• Op?malalgorithmvariesbasedontransfersize– O(PlogP)algorithm
doublesmessagesizeapereveryround
– Expensiveastheprocessorcountgrows
• Cross‐overpointispla�ormspecific G
OOD
Collec?veSynchroniza?on
• UPCpresentsnewseman?csforcollec?vesynchroniza?on• Loosesynchroniza?onletscollec?vebeginaperany
threadarrives(looserthanMPIseman?cs)• Strictsynchroniza?onrequiresallthreadstoarrive
beforecommunica?oncanstart(stricterthanMPI)• Synchroniza?onseman?csaffectsalgorithmchoice
Mul?coreBarrierPerformanceResults• Timemanyback‐to‐backbarriersonDual
SocketSunNiagra2(128hardwarethreads)
• Flattreeisjustonelevelwithallthreadsrepor?ngtothread0
– Leveragessharedmemorybutnon‐scalable
• ArchitectureIndependentTree(radix=2)
– Pickageneric“good”radixthatissuitableformanypla�orms
– Mismatchedtoarchitecture
• ArchitectureDependentTree
– Searchoverallradicestopickthetreethatbestmatchesthearchitecture
– Searchrevealedradix8tobethebest
GOOD
0
2
4
6
8
10
12
14
16
18
2 4 8 16 32 64 128
ExecuM
onTim
e(us)
ThreadCount
FlatTree
ArchitectureIndependent(radix=2)
ArchitectureDependent(radix=best)
Niagara2(Maramba)BarrierPerformance
46
GOOD
SummaryandConclusions(1/2)
• Possibletominimizecommunica?oncomplexityofmuchdenseandsparselinearalgebra– Prac?calspeedups– Approachingtheore?callowerbounds
• Hardwaretrendsmeanthe?mehascometodothis
• Op?malasympto?ccomplexityalgorithmsfordenselinearalgebra–alsolowercommunica?on
• Importanttoop?mizecommunica?oncollec?vesthemselves
SummaryandConclusions(2/2)
• Manyopenproblems– Howtoaddthesetoautoma?ctuningdesignspace
• PLASMA(JakubKurzak)
– Extendop?malityproofs,algorithmstomoregeneralarchitectures• Heterogeneity,includingmul?plebandwidthsandlatencies
– Denseeigenvalueproblems–SBRorspectralD&C?– Sparsedirectsolvers–CALUorSuperLU?– Whichprecondi?onerswork?
– Othermo?fs
AnswerstoQues?onsfromOrganizers(1/5)
• Whatabouttuningonpetascale?– Communica?onavoidingalgorithmshelpaddressit– Canleveragemul?coretuning,ifwecanmirrorhierarchyofmachinesinhierarchyofalgorithms/tuners
• Howdowemeasuresuccessfortuning?– PerformancegainsaWained– Easeofusebyend‐user
– Easeofdevelopment/evolu?on/maintenanceoftuners
• Whatarchitecturesshouldwetarget?– Mul?coreandPetascaleasabove,– EmergingarchitectureslikeGPUs
AnswerstoQues?onsfromOrganizers(2/5)
• T/F:“ParameterTuning”isnotenough– Manyofushavebeenchangingdatastructuresand
algorithmsforawhile,whichgoesbeyond“parameters”
• T/F:Self‐tunedlibrarieswillbeatcompilers– Whatisthestar?ngpointforthecompiler?Themostnaïve
possiblecode?Thenyes(butmayworkforsomemo?fs)
– Butautotunersusecompilerstocompiletunedsourcecode
• Willcompilerschangedatastructures/algorithms?– Wouldtheys?llbecalledcompilersorsomethingelse?– Howmuchwisdomaboutalgorithms,numericalanalysisand
appliedmathcanweexpectcompilerstoincorporate?
AnswerstoQues?onsfromOrganizers(3/5)• Simpleperformancemodelswillmakesearch
unnecessary,eg“cache‐oblivious”– Ifsearchiseasycomparedtofiguringouta“simple”
performancemodel,itismoreproduc?ve
– “Cacheoblivious”wouldnothavefoundnewcommunica?onavoidingalgorithms
– Simplemodelshelpdecidewhentostoptuning
• BoundariesbetweenSWlayertoorigid– Yes!SeeParLabstructureforonetakeonbreakingusual
layeringoftheHW/SWstack,egschedulers
AnswerstoQues?onsfromOrganizers(4/5)• Whatissues/technologiesareweignoring?
– Needmorecompiler‐basedexper?setobuildautotuners,sotheyareeasiertodesign/develop/maintain,andnotjustbigPERLscripsthatneedtobewriWenfromscratchforeachnewarchitectureorslightlydifferentfunc?on
• Whatcommontoolsshouldwebuild?– Seeanswertolastques?on;seePluto,talkbyChen
• Tuningsystemsaretoonarrow,investincompilersinstead
– Ifcompilersdonotchangedatastructuresoralgorithms,theywillmissalotofperformancegains
– ParLabhypothesisisthat13mo?fs/tunersisenough– Investincompilertoolstohelpbuildautotuners
AnswerstoQues?onsfromOrganizers(5/5)• Run?meop?miza?onisneeded
– Yes!Anumberofautotunersdothisnow,egJIT,OSKI
• IfalllayersoftheSWstackareautotuned,howwilltheybecomposed?
– Abigques?onforParLabtoo!Oneanswerformul?coreisinhavingschedulersforeachlibrarythatcanacceptorgiveupresources(cores)ontheflysothathigherlevelschedulerscanbalanceworkatahigherlevel
EXTRASLIDES
AvoidingCommunica?oninSparseLinearAlgebra‐Summary
• TakekstepsofKrylovsubspacemethod– GMRES,CG,Lanczos,Arnoldi– Assumematrix“well‐par??oned,”withmodestsurface‐to‐volumera?o
– Parallelimplementa?on• Conven?onal:O(klogp)messages• “New”:O(logp)messages‐op?mal
– Serialimplementa?on• Conven?onal:O(k)movesofdatafromslowtofastmemory• “New”:O(1)movesofdata–op?mal
• Canincorporatesomeprecondi?oners– Hierarchical,semiseparablematrices…
• Lotsofspeeduppossible(modeledandmeasured)– Price:someredundantcomputa?on
5 10 15 20 25 30
0
1
2
3
4
5
6
7
8
Local Dependencies for k=8
x
Ax
A2x
A3x
A4x
A5x
A6x
A7x
A8x
LocallyDependentEntriesfor[x,Ax],Atridiagonal,2processors
Canbecomputedwithoutcommunica?on
Proc1Proc2
5 10 15 20 25 30
0
1
2
3
4
5
6
7
8
Local Dependencies for k=8
x
Ax
A2x
A3x
A4x
A5x
A6x
A7x
A8x
Canbecomputedwithoutcommunica?on
Proc1Proc2
LocallyDependentEntriesfor[x,Ax,A2x],Atridiagonal,2processors
5 10 15 20 25 30
0
1
2
3
4
5
6
7
8
Local Dependencies for k=8
x
Ax
A2x
A3x
A4x
A5x
A6x
A7x
A8x
Canbecomputedwithoutcommunica?on
Proc1Proc2
LocallyDependentEntriesfor[x,Ax,…,A3x],Atridiagonal,2processors
5 10 15 20 25 30
0
1
2
3
4
5
6
7
8
Local Dependencies for k=8
x
Ax
A2x
A3x
A4x
A5x
A6x
A7x
A8x
Canbecomputedwithoutcommunica?on
Proc1Proc2
LocallyDependentEntriesfor[x,Ax,…,A4x],Atridiagonal,2processors
5 10 15 20 25 30
0
1
2
3
4
5
6
7
8
Local Dependencies for k=8
x
Ax
A2x
A3x
A4x
A5x
A6x
A7x
A8x
LocallyDependentEntriesfor[x,Ax,…,A8x],Atridiagonal,2processors
Canbecomputedwithoutcommunica?onk=8foldreuseofA
Proc1Proc2
5 10 15 20 25 30
0
1
2
3
4
5
6
7
8
Type (1) Remote Dependencies for k=8
RemotelyDependentEntriesfor[x,Ax,…,A8x],Atridiagonal,2processors
x
Ax
A2x
A3x
A4x
A5x
A6x
A7x
A8x
Onemessagetogetdataneededtocomputeremotelydependententries,not k=8 Minimizesnumberofmessages=latencycostPrice:redundantwork∝“surface/volumera?o”
Proc1Proc2
5 10 15 20 25 30
0
1
2
3
4
5
6
7
8
Type (2) Remote Dependencies for k=8
FewerRemotelyDependentEntriesfor[x,Ax,…,A8x],Atridiagonal,2processors
x
Ax
A2x
A3x
A4x
A5x
A6x
A7x
A8x
Reduceredundantworkbyhalf
Proc1Proc2
RemotelyDependentEntriesfor[x,Ax,A2x,A3x],Airregular,mulMpleprocessors
SequenMal[x,Ax,…,A4x],withmemoryhierarchy
v
Onereadofmatrixfromslowmemory,not k=4 Minimizeswordsmoved=bandwidthcost
Noredundantwork
PerformanceResults
• Measured– Sequen?al/OOCspeedupupto3x
• Modeled– Sequen?al/mul?corespeedupupto2.5x
– Parallel/Petascalespeedupupto6.9x– Parallel/Gridspeedupupto22x
• Seebebop.cs.berkeley.edu/#pubs
Op?mizingCommunica?onComplexityofSparseSolvers
• Example:GMRESforAx=bon“2DMesh”– xlivesonn‐by‐nmesh– Par??onedonp½‐by‐p½grid– A has“5pointstencil”(Laplacian)
• (Ax)(i,j)=linear_combina?on(x(i,j),x(i,j±1),x(i±1,j))– Ex:18‐by‐18meshon3‐by‐3grid
MinimizingCommunica?onofGMRES
• Whatisthecost=(#flops,#words,#mess)ofkstepsofstandardGMRES?
GMRES,ver.1:fori=1tokw=A*v(i‐1)MGS(w,v(0),…,v(i‐1))updatev(i),HendforsolveLSQproblemwithH n/p½
n/p½
• Cost(A*v)=k*(9n2/p,4n/p½,4)• Cost(MGS)=k2/2*(4n2/p,logp,logp)• Totalcost~Cost(A*v)+Cost(MGS)• Canwereducethelatency?
MinimizingCommunica?onofGMRES• Cost(GMRES,ver.1)=Cost(A*v)+Cost(MGS)
• Cost(W)=(~same,~same,8)• Latencycostindependentofk–opMmal
• Cost(MGS)unchanged• Canwereducethelatencymore?
=(9kn2/p,4kn/p½,4k)+(2k2n2/p,k2logp/2,k2logp/2)
• HowmuchlatencycostfromA*vcanyouavoid?Almostall
GMRES,ver.2:W=[v,Av,A2v,…,Akv][Q,R]=MGS(W)BuildHfromR,solveLSQproblem
k=3
MinimizingCommunica?onofGMRES• Cost(GMRES,ver.2)=Cost(W)+Cost(MGS)
=(9kn2/p,4kn/p½,8)+(2k2n2/p,k2logp/2,k2logp/2)
• HowmuchlatencycostfromMGScanyouavoid?Almostall
• Cost(TSQR)=(~same,~same,logp)• Latencycostindependentofs‐opMmal
GMRES,ver.3:W=[v,Av,A2v,…,Akv][Q,R]=TSQR(W)…“TallSkinnyQR”BuildHfromR,solveLSQproblem
W=
W1W2W3W4
R1R2R3R4
R12
R34
R1234
MinimizingCommunica?onofGMRES• Cost(GMRES,ver.2)=Cost(W)+Cost(MGS)
=(9kn2/p,4kn/p½,8)+(2k2n2/p,k2logp/2,k2logp/2)
• HowmuchlatencycostfromMGScanyouavoid?Almostall
• Cost(TSQR)=(~same,~same,logp)• Oops
GMRES,ver.3:W=[v,Av,A2v,…,Akv][Q,R]=TSQR(W)…“TallSkinnyQR”BuildHfromR,solveLSQproblem
W=
W1W2W3W4
R1R2R3R4
R12
R34
R1234
MinimizingCommunica?onofGMRES• Cost(GMRES,ver.2)=Cost(W)+Cost(MGS)
=(9kn2/p,4kn/p½,8)+(2k2n2/p,k2logp/2,k2logp/2)
• HowmuchlatencycostfromMGScanyouavoid?Almostall
• Cost(TSQR)=(~same,~same,logp)• Oops–Wfrompowermethod,precisionlost!
GMRES,ver.3:W=[v,Av,A2v,…,Akv][Q,R]=TSQR(W)…“TallSkinnyQR”BuildHfromR,solveLSQproblem
W=
W1W2W3W4
R1R2R3R4
R12
R34
R1234
MinimizingCommunica?onofGMRES• Cost(GMRES,ver.3)=Cost(W)+Cost(TSQR)
=(9kn2/p,4kn/p½,8)+(2k2n2/p,k2logp/2,logp)
• Latencycostindependentofk,justlogp–op?mal• Oops–Wfrompowermethod,soprecisionlost–Whattodo?
• Useadifferentpolynomialbasis• NotMonomialbasisW=[v,Av,A2v,…],instead…• NewtonBasisWN=[v,(A–θ1I)v,(A–θ2I)(A–θ1I)v,…]or• ChebyshevBasisWC=[v,T1(v),T2(v),…]
RelatedWorkandContribu?onsforSparseLinearAlgebra
• Relatedwork– s‐stepGMRES:DeSturler(1991),Bai,Hu&Reichel(1991),
Joubertetal(1992),Erhel(1995)
– s‐stepCG:VanRosendale(1983),Chronopoulos&Gear(1989),Toledo(1995)– MatrixPowersKernel:Pfeifer(1963),Hong&Kung(1981),
Leiserson,Rao&Toledo(1993),Toledo(1995),Douglas,Hu,Kowarschik,Rüde,Weiss(2000),Strout,Carter&Ferrante(2001)
• Ourcontribu?ons– Unifiedapproachtoserialandparallel,useofTSQR– Op?mizingserialcaseviaTSP
– Unifiedapproachtostability– Incorpora?ngprecondi?oning
SummaryandConclusions(1/2)
• Possibletominimizecommunica?oncomplexityofmuchdenseandsparselinearalgebra– Prac?calspeedups– Approachingtheore?callowerbounds
• Op?malasympto?ccomplexityalgorithmsfordenselinearalgebra–alsolowercommunica?on
• Hardwaretrendsmeanthe?mehascometodothis
• Lotsofpriorwork(seepubs)–andsomenew
SummaryandConclusions(2/2)
• Manyopenproblems– Automa?ctuning‐buildandop?mizecomplicateddatastructures,communica?onpaWerns,codeautoma?cally:bebop.cs.berkeley.edu
– Extendop?malityproofstogeneralarchitectures– Denseeigenvalueproblems–SBRorspectralD&C?
– Sparsedirectsolvers–CALUorSuperLU?– Whichprecondi?onerswork?– Whystopatlinearalgebra?