recent progress in autotuning at berkeley

RecentProgressinAutotuningatBerkeley

JimDemmelUCBerkeley

bebop.cs.berkeley.eduparlab.eecs.berkeley.edu

Outline

•  Introduc?ontoParLab•  Summaryofautotuningac?vi?es

– AlsoShoaibKamil’stalkonThursday@10:15

•  DenseLinearAlgebra•  SparseLinearAlgebra(summary)

•  Communica?oncollec?ves

•  Responsestoques?onsfromtheorganizers

OverviewoftheParallelLaboratoryKrsteAsanovic,RasBodik,JimDemmel,

TomKeaveny,KurtKeutzer,JohnKubiatowicz,EdwardLee,NelsonMorgan,GeorgeNecula,DavePaWerson,

KoushikSen,JohnWawrzynek,DavidWessel,KathyYelick

parlab.eecs.berkeley.edu

4

  How do compelling apps relate to 13 motifs?

“Mo?f"Popularity (RedHot→

5


6


7

Personal Health

Image Retrieval

Hearing, Music Speech Parallel

Browser Motifs

Sketching

Legacy Code Schedulers Communication &

Synch. Primitives Efficiency Language Compilers

ParLabResearchOverviewEasy to write correct programs that run efficiently on manycore

Legacy OS

Multicore/GPGPU

OS Libraries & Services

RAMP Manycore

Hypervisor

Cor

rect

ness

Composition & Coordination Language (C&CL)

Parallel Libraries

Parallel Frameworks

Static Verification

Dynamic Checking

Debugging with Replay

Directed Testing

Autotuners

C&CL Compiler/Interpreter

Efficiency Languages

Type Systems

Dia

gnos

ing

Pow

er/P

erfo

rman

ce

8

Personal Health

Image Retrieval

Hearing, Music Speech Parallel

Browser Motifs

Sketching

Legacy Code Schedulers Communication &

Synch. Primitives Efficiency Language Compilers

ParLabResearchOverviewEasy to write correct programs that run efficiently on manycore

Legacy OS

Multicore/GPGPU

OS Libraries & Services

RAMP Manycore

Hypervisor

Cor

rect

ness

Composition & Coordination Language (C&CL)

Parallel Libraries

Parallel Frameworks

Static Verification

Dynamic Checking

Debugging with Replay

Directed Testing

Autotuners

C&CL Compiler/Interpreter

Efficiency Languages

Type Systems

Dia

gnos

ing

Pow

er/P

erfo

rman

ce

SummaryofAutotuning(1/2)

•  Denselinearalgebra– NewalgorithmsthataWainlowerboundsoncommunica?on(parallelandsequen?al)

–  NewalgorithmsforGPUs

•  Sparselinearalgebra(summary)– NewalgorithmsthataWainlowerboundsoncomm.

•  Collec?vecommunica?ons– Choosingrighttreeforreduc?ons/broadcasts/etc

SummaryofAutotuning(2/2)•  To be presented by Shoaib Kamil on Thursday

•  Recentworkautotuningthreeparallelkernels–  SparseMatrixVectorMul?ply(SpMV)–  LaeceBoltzmannMHD

–  Stencils(HeatEqua?on)•  Lessonslearned/commonali?esbetweentheautotuners

•  Towardsaframeworkforbuildingautotuners– Whatistheroleofthecompiler?

Mo?va?on•  Running?meofanalgorithmissumof3terms:

•  #flops*?me_per_flop•  #wordsmoved/bandwidth

•  #messages*latency

•  Exponen?allygrowinggapsbetween•  Time_per_flop<<1/NetworkBW<<NetworkLatency

•  Improving59%/yearvs26%/yearvs15%/year

•  Time_per_flop<<1/MemoryBW<<MemoryLatency•  Improving59%/yearvs23%/yearvs5.5%/year

communica?on

Mo?va?on•  Running?meofanalgorithmissumof3terms:

•  #flops*?me_per_flop•  #wordsmoved/bandwidth

•  #messages*latency

•  Exponen?allygrowinggapsbetween•  Time_per_flop<<1/NetworkBW<<NetworkLatency

•  Improving59%/yearvs26%/yearvs15%/year

•  Time_per_flop<<1/MemoryBW<<MemoryLatency•  Improving59%/yearvs23%/yearvs5.5%/year

•  Goal:reorganize/tunemo?fstoavoidcommunica?on•  Not justhidingcommunica?on(speedup≤2x)

•  Arbitraryspeedupspossibleforlinearalgebra•  Successmetric–aWainlowerboundsoncommunica?on

communica?on

LinearAlgebraCollaborators(sofar)•  UCBerkeley

–  KathyYelick,MingGu–  MarkHoemmen,MarghoobMohiyuddin,KaushikDaWa,GeorgePetropoulos,

SamWilliams,BeBOpgroup–  LennyOliker,JohnShalf–  ParLab/UPCRC–newparallelcompu?ngresearchcenterfundedbyInteland

Microsop•  CUDenver

–  JulienLangou•  INRIA

–  LauraGrigori,HuaXiang•  GeorgiaTech

–  RichVuduc

•  Densework⊆ongoingdevelopmentofSca/LAPACK–  JointwithUTK

Communica?onLowerBoundsforDenseLinearAlgebra(1/2)

•  Hong‐Kung(1981),Irony/Tishkin/Toledo(2004)•  Usualmatmulusing2n3flops•  Sequen?alcase,withfastmemoryofsizeW<3n2

–  Thm:#wordsmoved=Ω(n3/W1/2)

•  Parallelcase,Pprocs,O(n2/P)memory/proc–  Thm:#wordsmoved=Ω(n2/P1/2)

•  Bandwidthlowerbound⇒latencylowerbound–  #messages≥#wordsmoved/(fast,local)memorysize

–  Sequen?al:#messages=Ω(n3/W3/2)

–  Parallel:#messages=Ω(P1/2)

Communica?onLowerBoundsforDenseLinearAlgebra(2/2)

•  SamelowerboundsapplytoLUandQR–  Assump?on:O(n3)algorithms;LUiseasybutQRissubtle

•  LAPACKandScaLAPACKdonotaWainthesebounds–  ScaLAPACKaWainsbandwidthbound

–  ButsendsO((mn/P)1/2)?mesmoremessages–  LAPACKaWainsneither;O((m2/W)1/2)xmorewordsmoved

•  ButnewalgorithmsdoaWainthem,modpolylogfactors–  ParallelQR:requiresnewrepresenta?onofQ,redundantflops–  Sequen?alQR:canusenewone,orrecursive–  ParallelLU:newpivo?ngstrategyneeded,stableinprac?ce–  Sequen?alLU:canusenewoneorrecursive,buthigherlatency

MinimizingCommunica?oninParallelQR

W=

W0 W1 W2 W3

R00 R10 R20 R30

R01

R11

R02

• QRdecomposi?onofmxnmatrixW,m>>n• TSQR=“TallSkinnyQR”• Pprocessors,blockrowlayout

• UsualParallelAlgorithm• ComputeHouseholdervectorforeachcolumn• Numberofmessages∝nlogP

• Communica?onAvoidingAlgorithm• Reduc?onopera?on,withQRasoperator• Numberofmessages∝logP‐op?mal

TSQRinmoredetail

Qisrepresentedimplicitlyasaproduct(treeoffactors)

MinimizingCommunica?oninTSQR

W=

W0 W1 W2 W3

R00 R10 R20 R30

R01

R11

R02 Parallel:

W=

W0 W1 W2 W3

R01 R02

R00

R03 Sequen?al:

W=

W0 W1 W2 W3

R00 R01

R01 R11

R02

R11 R03

DualCore:

Choosereduc?ontreedynamically

Mul?core/Mul?socket/Mul?rack/Mul?site/Out‐of‐core:?

PerformanceofTSQRvsSca/LAPACK•  Parallel

–  Pen?umIIIcluster,DolphinInterconnect,MPICH•  Upto6.7xspeedup(16procs,100Kx200)

–  BlueGene/L•  Upto4xspeedup(32procs,1Mx50)

–  BothuseElmroth‐Gustavsonlocally–enabledbyTSQR•  Sequen?al

–  OOConPowerPClaptop•  AsliWleas2xslowdownvs(predicted)infiniteDRAM

•  SeeUCBerkeleyEECSTechReport2008‐74•  Beingrevisedtoaddop?malityresults!

•  GeneralrectangularQR•  TSQRforpanelfactoriza?on,thenright‐looking•  Implementa?onunderway(JulienLangou)

ModeledSpeedupsofCAQRvsScaLAPACK

Petascaleupto22.9x

IBMPower5upto9.7x

“Grid”upto11x

Petascalemachinewith8192procs,eachat500GFlops/s,abandwidthof4GB/s.

TSLU:LUfactoriza?onofatallskinnymatrix

Firsttrytheobviousgeneraliza?onofTSQR:

GrowthfactorforTSLUbasedfactoriza?on

UnstableforlargePandlargematrices.WhenP=#rows,TSLUisequivalenttoparallelpivo?ng.

CourtesyofH.Xiang

MakingTSLUStable

•  Ateachnodeintree,TSLUselectsbpivotrowsfrom2bcandidatesfromits2childnodes

•  Ateachnode,doLUon2boriginalrowsselectedbychildnodes,notUfactorsfromchildnodes

•  WhenTSLUdone,permutebselectedrowstotopoforiginalmatrix,redobstepsofLUwithoutpivo?ng

•  CALU–Communica?onAvoidingLUforgeneralA–  UseTSLUforpanelfactoriza?ons–  Applytorestofmatrix–  Cost:redundantpanelfactoriza?ons

•  Benefit:–  Stableinprac?ce,butnot samepivotchoiceasGEPP–  b?mesfewermessagesoverall‐faster

GrowthfactorforbeWerCALUapproach

Likethresholdpivo?ngwithworstcasethreshold=.33,so|L|<=3Tes?ngshowsaboutsameresidualasGEPP

PerformancevsScaLAPACK•  TSLU

–  IBMPower5•  Upto4.37xfaster(16procs,1Mx150)

–  CrayXT4•  Upto5.52xfaster(8procs,1Mx150)

•  CALU–  IBMPower5

•  Upto2.29xfaster(64procs,1000x1000)–  CrayXT4

•  Upto1.81xfaster(64procs,1000x1000)•  Op?malityanalysisanalogoustoQR•  SeeINRIATechReport6523(2008)

Petascalemachinewith8192procs,eachat500GFlops/s,abandwidthof4GB/s.

Speeduppredic?onforaPetascalemachine‐upto81xfasterP=8192

RelatedWorkandContribu?onsforDenseLinearAlgebra

•  Relatedwork– Pothen&Raghavan(1989)– Toledo(1997),Rabani&Toledo(2001)– Dunha,Becker&PaWerson(2002)– Gunter&vandeGeijn(2005)

•  Ourcontribu?ons– QR:2Dparallel,efficient1DparallelQR– Unifiedapproachtosequen?al,parallel,…– ParallelLU– Communica?onlowerbounds,provingop?mality

28

Heterogeneous(CPU+GPU)PerformanceLibraries

VasilyVolkov

ChallengesinprogrammingGPUs

29

Relatively little tuning information for GPU hardware -  Hardware exposed via abstract programming model -  Hard to develop accurate performance models -  Hardware changing rapidly

We get ~2x speedups vs. best other work on variety of libraries -  Matrix-matrix multiply (1.6x speedup) -  FFT (2x speedup) -  LU/Cholesky factorization (3x/2x speedups) -  Tridiagonal eigenvalue solver (2x speedup)

Example: partial pivoting in LU factorization on NVIDIA GPU -  Ad: “GPU supports gather/scatter with arbitrary access patterns” -  LAPACK code (sgetrf) spends 50% of time doing pivoting -  Gets worse with larger matrices -  Only 1% of time is spent in pivoting if matrix is transposed

-  2x speedup!

Matrix‐matrixmul?ply(SGEMM)

30

- Uses software prefetching, strip-mining, register blocking - Resembles algorithms used for Cray X1 and IBM 3090 VF - 60% of peak, bound by access to on-chip shared memory

FFT,complex‐to‐complex,batched

31

Results as on NVIDIA GeForce 8800GTX Resembles algorithms used for vector processors Radix-8 in registers, transposes using shared memory

0

50

100

150

200

250

1 8 64 512 4096 32768

benc

hFFT

Gflo

p/s

N

registerlstorage

CUFFT1.1

ourbest

HeterogeneousCompu?ng

32

Question: should you port entire applications to GPU? – Maybe not

Example: Eigenvalue solver (bisection, LAPACK’s sstebz) -  Most work (O(n2)) in vectorized, embarrassingly parallel code

-  Do it on GPU and finely tune -  May do many redundant flops on GPU to go faster

-  multisection -  Run on CPU when problem too small for GPU -  Use performance model to choose CPU or GPU version

-  Little work (O(n)) in nearly serial code -  Do it on CPU: time-consuming and non-trivial to port

Independent project: - Did everything on the GPU - Used advanced parallel algorithms for O(n) work - Up to 2x slower running on faster GPU (compared to us)

BreakdownofRun?me(sstebz)

33

“Count(x)” is the O(n2) work, use either SSE or GPU “the rest” is the O(n) work, isn’t worth optimizing

LUFactoriza?on

34

LU factorization: -  O(n3) work in bulky BLAS3 operations -  O(n2) work in fine-grain BLAS1 / BLAS2 (panel factorization)

Panel factorization is usually faster on the CPU - Peak sustained bandwidth of 8800GTX is 75GB/s - Kernel launch overhead is 5µs -  BLAS1 and BLAS2 are bandwidth bound, so run at

-  compare vs. CPU handicapped by CPU-GPU transfers -  Faster on CPU up to n ~ 16K on NVIDIA 8800GTX

OverallPerformance

35

OverlapinLUfactoriza?on

36

Overlap panel factorization on CPU with sgemm on GPU

LUslowdownsfromomiengop?miza?ons

SummaryofGPUExperience•  Heterogeneityexpandstuningspace

– DividingworkbetweenCPUandGPU•  Dependsalotonflop‐rate/bandwidth/latency/startup

– DifferentdatalayoutsmaybebestforCPUandGPU•  RowwisevscolumnwiseforLU

– Maywanttodo(many)extraflopsonGPU•  Mul?sec?onvsBisec?onineigensolver

•  TRINV+TRMMvsTRSMinLU

•  Rapidevolu?onofGPUarchitecturesmo?vatesautotuning

Whyallourproblemsaresolvedfordenselinearalgebra–intheory

•  Thm(D.,Dumitriu,Holtz,Kleinberg)(Numer.Math.2007)–  GivenanymatmulrunninginO(nω)opsforsomeω>2,itcanbe

madestableands?llruninO(nω+ε)ops,foranyε>0.

•  Currentrecord:ω≈2.38•  Thm(D.,Dumitriu,Holtz)(Numer.Math.2008)

–  GivenanystablematmulrunninginO(nω+ε)ops,candobackwardstabledenselinearalgebrainO(nω+ε)ops:

•  GEPP,QR(recursive)•  rankrevealingQR(randomized)•  (Generalized)Schurdecomposi?on,SVD(randomized)

•  Alsoreducescommunica?ontoO(nω+ε)•  Butconstants?

39

AvoidingCommunica?oninSparseLinearAlgebra‐Summary

•  TakekstepsofKrylovsubspacemethod– GMRES,CG,Lanczos,Arnoldi– Assumematrix“well‐par??oned,”withmodestsurface‐to‐volumera?o

–  Parallelimplementa?on•  Conven?onal:O(klogp)messages•  “New”:O(logp)messages‐op?mal

–  Serialimplementa?on•  Conven?onal:O(k)movesofdatafromslowtofastmemory•  “New”:O(1)movesofdata–op?mal

•  Canincorporatesomeprecondi?oners– Hierarchical,semiseparablematrices…

•  Lotsofspeeduppossible(modeledandmeasured)–  Price:someredundantcomputa?on

TuningCollec?ves•  RajeshNishtala

•  Collec?vescalledbyallthreadstoperformgloballycommunica?on–  Broadcase,reduc?on,all‐to‐all

•  Needtotunebecausebestalgorithmdependson–  #procs,–  Networklatency–  Networkbandwidth,–  Transfersize,etc.

•  FocusonPGASlanguagesandone‐sidedcommunica?onmodels–  E.g.UPC,Titanium,Co‐ArrayFortran

41

ExampleTreeTopologies

Radix4k‐nomialtree(quadnomial)

Radix2k‐nomialtree(binomial)

42BinaryTree ForkTree

43

DistributedMemoryBroadcast

•  100016Bytebroadcasts•  Bestimplementa?onis

pla�ormspecific•  Lowcosttoinject

message⇒wantflattree(6‐ary)

•  Highcost⇒wantdeeptreetoparallelizetheoverhead(binomial)

GOOD

44

DistributedMemoryAll‐to‐All

•  Op?malalgorithmvariesbasedontransfersize–  O(PlogP)algorithm

doublesmessagesizeapereveryround

–  Expensiveastheprocessorcountgrows

•  Cross‐overpointispla�ormspecific G

OOD

Collec?veSynchroniza?on

•  UPCpresentsnewseman?csforcollec?vesynchroniza?on•  Loosesynchroniza?onletscollec?vebeginaperany

threadarrives(looserthanMPIseman?cs)•  Strictsynchroniza?onrequiresallthreadstoarrive

beforecommunica?oncanstart(stricterthanMPI)•  Synchroniza?onseman?csaffectsalgorithmchoice

Mul?coreBarrierPerformanceResults•  Timemanyback‐to‐backbarriersonDual

SocketSunNiagra2(128hardwarethreads)

•  Flattreeisjustonelevelwithallthreadsrepor?ngtothread0

–  Leveragessharedmemorybutnon‐scalable

•  ArchitectureIndependentTree(radix=2)

–  Pickageneric“good”radixthatissuitableformanypla�orms

–  Mismatchedtoarchitecture

•  ArchitectureDependentTree

–  Searchoverallradicestopickthetreethatbestmatchesthearchitecture

–  Searchrevealedradix8tobethebest

GOOD

0

2

4

6

8

10

12

14

16

18

2 4 8 16 32 64 128

ExecuM

onTim

e(us)

ThreadCount

FlatTree

ArchitectureIndependent(radix=2)

ArchitectureDependent(radix=best)

Niagara2(Maramba)BarrierPerformance

46

GOOD

SummaryandConclusions(1/2)

•  Possibletominimizecommunica?oncomplexityofmuchdenseandsparselinearalgebra–  Prac?calspeedups– Approachingtheore?callowerbounds

•  Hardwaretrendsmeanthe?mehascometodothis

•  Op?malasympto?ccomplexityalgorithmsfordenselinearalgebra–alsolowercommunica?on

•  Importanttoop?mizecommunica?oncollec?vesthemselves


•  Manyopenproblems– Howtoaddthesetoautoma?ctuningdesignspace

•  PLASMA(JakubKurzak)

– Extendop?malityproofs,algorithmstomoregeneralarchitectures•  Heterogeneity,includingmul?plebandwidthsandlatencies

– Denseeigenvalueproblems–SBRorspectralD&C?– Sparsedirectsolvers–CALUorSuperLU?– Whichprecondi?onerswork?

– Othermo?fs

AnswerstoQues?onsfromOrganizers(1/5)

•  Whatabouttuningonpetascale?–  Communica?onavoidingalgorithmshelpaddressit–  Canleveragemul?coretuning,ifwecanmirrorhierarchyofmachinesinhierarchyofalgorithms/tuners

•  Howdowemeasuresuccessfortuning?–  PerformancegainsaWained–  Easeofusebyend‐user

–  Easeofdevelopment/evolu?on/maintenanceoftuners

•  Whatarchitecturesshouldwetarget?–  Mul?coreandPetascaleasabove,–  EmergingarchitectureslikeGPUs

AnswerstoQues?onsfromOrganizers(2/5)

•  T/F:“ParameterTuning”isnotenough–  Manyofushavebeenchangingdatastructuresand

algorithmsforawhile,whichgoesbeyond“parameters”

•  T/F:Self‐tunedlibrarieswillbeatcompilers–  Whatisthestar?ngpointforthecompiler?Themostnaïve

possiblecode?Thenyes(butmayworkforsomemo?fs)

–  Butautotunersusecompilerstocompiletunedsourcecode

•  Willcompilerschangedatastructures/algorithms?–  Wouldtheys?llbecalledcompilersorsomethingelse?–  Howmuchwisdomaboutalgorithms,numericalanalysisand

appliedmathcanweexpectcompilerstoincorporate?

AnswerstoQues?onsfromOrganizers(3/5)•  Simpleperformancemodelswillmakesearch

unnecessary,eg“cache‐oblivious”–  Ifsearchiseasycomparedtofiguringouta“simple”

performancemodel,itismoreproduc?ve

–  “Cacheoblivious”wouldnothavefoundnewcommunica?onavoidingalgorithms

–  Simplemodelshelpdecidewhentostoptuning

•  BoundariesbetweenSWlayertoorigid–  Yes!SeeParLabstructureforonetakeonbreakingusual

layeringoftheHW/SWstack,egschedulers

AnswerstoQues?onsfromOrganizers(4/5)•  Whatissues/technologiesareweignoring?

–  Needmorecompiler‐basedexper?setobuildautotuners,sotheyareeasiertodesign/develop/maintain,andnotjustbigPERLscripsthatneedtobewriWenfromscratchforeachnewarchitectureorslightlydifferentfunc?on

•  Whatcommontoolsshouldwebuild?–  Seeanswertolastques?on;seePluto,talkbyChen

•  Tuningsystemsaretoonarrow,investincompilersinstead

–  Ifcompilersdonotchangedatastructuresoralgorithms,theywillmissalotofperformancegains

–  ParLabhypothesisisthat13mo?fs/tunersisenough–  Investincompilertoolstohelpbuildautotuners

AnswerstoQues?onsfromOrganizers(5/5)•  Run?meop?miza?onisneeded

–  Yes!Anumberofautotunersdothisnow,egJIT,OSKI

•  IfalllayersoftheSWstackareautotuned,howwilltheybecomposed?

–  Abigques?onforParLabtoo!Oneanswerformul?coreisinhavingschedulersforeachlibrarythatcanacceptorgiveupresources(cores)ontheflysothathigherlevelschedulerscanbalanceworkatahigherlevel

EXTRASLIDES

AvoidingCommunica?oninSparseLinearAlgebra‐Summary

•  TakekstepsofKrylovsubspacemethod– GMRES,CG,Lanczos,Arnoldi– Assumematrix“well‐par??oned,”withmodestsurface‐to‐volumera?o

–  Parallelimplementa?on•  Conven?onal:O(klogp)messages•  “New”:O(logp)messages‐op?mal

–  Serialimplementa?on•  Conven?onal:O(k)movesofdatafromslowtofastmemory•  “New”:O(1)movesofdata–op?mal

•  Canincorporatesomeprecondi?oners– Hierarchical,semiseparablematrices…

•  Lotsofspeeduppossible(modeledandmeasured)–  Price:someredundantcomputa?on

5 10 15 20 25 30

0

1

2

3

4

5

6

7

8

Local Dependencies for k=8

x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x

LocallyDependentEntriesfor[x,Ax],Atridiagonal,2processors

Canbecomputedwithoutcommunica?on

Proc1Proc2

5 10 15 20 25 30

0

1

2

3

4

5

6

7

8


x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x


Proc1Proc2

LocallyDependentEntriesfor[x,Ax,A2x],Atridiagonal,2processors

5 10 15 20 25 30

0

1

2

3

4

5

6

7

8


x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x


Proc1Proc2

LocallyDependentEntriesfor[x,Ax,…,A3x],Atridiagonal,2processors

5 10 15 20 25 30

0

1

2

3

4

5

6

7

8


x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x


Proc1Proc2


5 10 15 20 25 30

0

1

2

3

4

5

6

7

8


x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x


Canbecomputedwithoutcommunica?onk=8foldreuseofA

Proc1Proc2

5 10 15 20 25 30

0

1

2

3

4

5

6

7

8

Type (1) Remote Dependencies for k=8

RemotelyDependentEntriesfor[x,Ax,…,A8x],Atridiagonal,2processors

x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x

Onemessagetogetdataneededtocomputeremotelydependententries,not k=8 Minimizesnumberofmessages=latencycostPrice:redundantwork∝“surface/volumera?o”

Proc1Proc2

5 10 15 20 25 30

0

1

2

3

4

5

6

7

8

Type (2) Remote Dependencies for k=8

FewerRemotelyDependentEntriesfor[x,Ax,…,A8x],Atridiagonal,2processors

x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x

Reduceredundantworkbyhalf

Proc1Proc2

RemotelyDependentEntriesfor[x,Ax,A2x,A3x],Airregular,mulMpleprocessors

SequenMal[x,Ax,…,A4x],withmemoryhierarchy

v

Onereadofmatrixfromslowmemory,not k=4 Minimizeswordsmoved=bandwidthcost

Noredundantwork

PerformanceResults

•  Measured– Sequen?al/OOCspeedupupto3x

•  Modeled– Sequen?al/mul?corespeedupupto2.5x

– Parallel/Petascalespeedupupto6.9x– Parallel/Gridspeedupupto22x

•  Seebebop.cs.berkeley.edu/#pubs

Op?mizingCommunica?onComplexityofSparseSolvers

•  Example:GMRESforAx=bon“2DMesh”– xlivesonn‐by‐nmesh– Par??onedonp½‐by‐p½grid– A has“5pointstencil”(Laplacian)

•  (Ax)(i,j)=linear_combina?on(x(i,j),x(i,j±1),x(i±1,j))– Ex:18‐by‐18meshon3‐by‐3grid

MinimizingCommunica?onofGMRES

•  Whatisthecost=(#flops,#words,#mess)ofkstepsofstandardGMRES?

GMRES,ver.1:fori=1tokw=A*v(i‐1)MGS(w,v(0),…,v(i‐1))updatev(i),HendforsolveLSQproblemwithH n/p½

n/p½

• Cost(A*v)=k*(9n2/p,4n/p½,4)• Cost(MGS)=k2/2*(4n2/p,logp,logp)• Totalcost~Cost(A*v)+Cost(MGS)• Canwereducethelatency?

MinimizingCommunica?onofGMRES•  Cost(GMRES,ver.1)=Cost(A*v)+Cost(MGS)

• Cost(W)=(~same,~same,8)• Latencycostindependentofk–opMmal

• Cost(MGS)unchanged• Canwereducethelatencymore?

=(9kn2/p,4kn/p½,4k)+(2k2n2/p,k2logp/2,k2logp/2)

• HowmuchlatencycostfromA*vcanyouavoid?Almostall

GMRES,ver.2:W=[v,Av,A2v,…,Akv][Q,R]=MGS(W)BuildHfromR,solveLSQproblem

k=3

MinimizingCommunica?onofGMRES•  Cost(GMRES,ver.2)=Cost(W)+Cost(MGS)

=(9kn2/p,4kn/p½,8)+(2k2n2/p,k2logp/2,k2logp/2)

• HowmuchlatencycostfromMGScanyouavoid?Almostall

• Cost(TSQR)=(~same,~same,logp)• Latencycostindependentofs‐opMmal

GMRES,ver.3:W=[v,Av,A2v,…,Akv][Q,R]=TSQR(W)…“TallSkinnyQR”BuildHfromR,solveLSQproblem

W=

W1W2W3W4

R1R2R3R4

R12

R34

R1234




• Cost(TSQR)=(~same,~same,logp)• Oops


W=

W1W2W3W4

R1R2R3R4

R12

R34

R1234




• Cost(TSQR)=(~same,~same,logp)• Oops–Wfrompowermethod,precisionlost!


W=

W1W2W3W4

R1R2R3R4

R12

R34

R1234

MinimizingCommunica?onofGMRES•  Cost(GMRES,ver.3)=Cost(W)+Cost(TSQR)

=(9kn2/p,4kn/p½,8)+(2k2n2/p,k2logp/2,logp)

• Latencycostindependentofk,justlogp–op?mal• Oops–Wfrompowermethod,soprecisionlost–Whattodo?

• Useadifferentpolynomialbasis• NotMonomialbasisW=[v,Av,A2v,…],instead…• NewtonBasisWN=[v,(A–θ1I)v,(A–θ2I)(A–θ1I)v,…]or• ChebyshevBasisWC=[v,T1(v),T2(v),…]

RelatedWorkandContribu?onsforSparseLinearAlgebra

•  Relatedwork–  s‐stepGMRES:DeSturler(1991),Bai,Hu&Reichel(1991),

Joubertetal(1992),Erhel(1995)

–  s‐stepCG:VanRosendale(1983),Chronopoulos&Gear(1989),Toledo(1995)– MatrixPowersKernel:Pfeifer(1963),Hong&Kung(1981),

Leiserson,Rao&Toledo(1993),Toledo(1995),Douglas,Hu,Kowarschik,Rüde,Weiss(2000),Strout,Carter&Ferrante(2001)

•  Ourcontribu?ons–  Unifiedapproachtoserialandparallel,useofTSQR–  Op?mizingserialcaseviaTSP

–  Unifiedapproachtostability–  Incorpora?ngprecondi?oning


•  Possibletominimizecommunica?oncomplexityofmuchdenseandsparselinearalgebra– Prac?calspeedups– Approachingtheore?callowerbounds

•  Op?malasympto?ccomplexityalgorithmsfordenselinearalgebra–alsolowercommunica?on

•  Hardwaretrendsmeanthe?mehascometodothis

•  Lotsofpriorwork(seepubs)–andsomenew


•  Manyopenproblems– Automa?ctuning‐buildandop?mizecomplicateddatastructures,communica?onpaWerns,codeautoma?cally:bebop.cs.berkeley.edu

– Extendop?malityproofstogeneralarchitectures– Denseeigenvalueproblems–SBRorspectralD&C?

– Sparsedirectsolvers–CALUorSuperLU?– Whichprecondi?onerswork?– Whystopatlinearalgebra?

recent progress in autotuning at berkeley

Documents