recent progress in autotuning at berkeley

77
Recent Progress in Autotuning at Berkeley Jim Demmel UC Berkeley bebop.cs.berkeley.edu parlab.eecs.berkeley.edu

Upload: others

Post on 16-Oct-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recent Progress in Autotuning at Berkeley

RecentProgressinAutotuningatBerkeley

JimDemmelUCBerkeley

bebop.cs.berkeley.eduparlab.eecs.berkeley.edu

Page 2: Recent Progress in Autotuning at Berkeley

Outline

•  Introduc?ontoParLab•  Summaryofautotuningac?vi?es

– AlsoShoaibKamil’stalkonThursday@10:15

•  DenseLinearAlgebra•  SparseLinearAlgebra(summary)

•  Communica?oncollec?ves

•  Responsestoques?onsfromtheorganizers

Page 3: Recent Progress in Autotuning at Berkeley

OverviewoftheParallelLaboratoryKrsteAsanovic,RasBodik,JimDemmel,

TomKeaveny,KurtKeutzer,JohnKubiatowicz,EdwardLee,NelsonMorgan,GeorgeNecula,DavePaWerson,

KoushikSen,JohnWawrzynek,DavidWessel,KathyYelick

parlab.eecs.berkeley.edu

Page 4: Recent Progress in Autotuning at Berkeley

4

  How do compelling apps relate to 13 motifs?

“Mo?f"Popularity (RedHot→

Page 5: Recent Progress in Autotuning at Berkeley

5

  How do compelling apps relate to 13 motifs?

Page 6: Recent Progress in Autotuning at Berkeley

6

  How do compelling apps relate to 13 motifs?

Page 7: Recent Progress in Autotuning at Berkeley

7

Personal Health

Image Retrieval

Hearing, Music Speech Parallel

Browser Motifs

Sketching

Legacy Code Schedulers Communication &

Synch. Primitives Efficiency Language Compilers

ParLabResearchOverviewEasy to write correct programs that run efficiently on manycore

Legacy OS

Multicore/GPGPU

OS Libraries & Services

RAMP Manycore

Hypervisor

Cor

rect

ness

Composition & Coordination Language (C&CL)

Parallel Libraries

Parallel Frameworks

Static Verification

Dynamic Checking

Debugging with Replay

Directed Testing

Autotuners

C&CL Compiler/Interpreter

Efficiency Languages

Type Systems

Dia

gnos

ing

Pow

er/P

erfo

rman

ce

Page 8: Recent Progress in Autotuning at Berkeley

8

Personal Health

Image Retrieval

Hearing, Music Speech Parallel

Browser Motifs

Sketching

Legacy Code Schedulers Communication &

Synch. Primitives Efficiency Language Compilers

ParLabResearchOverviewEasy to write correct programs that run efficiently on manycore

Legacy OS

Multicore/GPGPU

OS Libraries & Services

RAMP Manycore

Hypervisor

Cor

rect

ness

Composition & Coordination Language (C&CL)

Parallel Libraries

Parallel Frameworks

Static Verification

Dynamic Checking

Debugging with Replay

Directed Testing

Autotuners

C&CL Compiler/Interpreter

Efficiency Languages

Type Systems

Dia

gnos

ing

Pow

er/P

erfo

rman

ce

Page 9: Recent Progress in Autotuning at Berkeley

SummaryofAutotuning(1/2)

•  Denselinearalgebra– NewalgorithmsthataWainlowerboundsoncommunica?on(parallelandsequen?al)

–  NewalgorithmsforGPUs

•  Sparselinearalgebra(summary)– NewalgorithmsthataWainlowerboundsoncomm.

•  Collec?vecommunica?ons– Choosingrighttreeforreduc?ons/broadcasts/etc

Page 10: Recent Progress in Autotuning at Berkeley

SummaryofAutotuning(2/2)•  To be presented by Shoaib Kamil on Thursday 

•  Recentworkautotuningthreeparallelkernels–  SparseMatrixVectorMul?ply(SpMV)–  LaeceBoltzmannMHD

–  Stencils(HeatEqua?on)•  Lessonslearned/commonali?esbetweentheautotuners

•  Towardsaframeworkforbuildingautotuners– Whatistheroleofthecompiler? 

Page 11: Recent Progress in Autotuning at Berkeley

Mo?va?on•  Running?meofanalgorithmissumof3terms:

•  #flops*?me_per_flop•  #wordsmoved/bandwidth

•  #messages*latency

•  Exponen?allygrowinggapsbetween•  Time_per_flop<<1/NetworkBW<<NetworkLatency

•  Improving59%/yearvs26%/yearvs15%/year

•  Time_per_flop<<1/MemoryBW<<MemoryLatency•  Improving59%/yearvs23%/yearvs5.5%/year

communica?on

Page 12: Recent Progress in Autotuning at Berkeley

Mo?va?on•  Running?meofanalgorithmissumof3terms:

•  #flops*?me_per_flop•  #wordsmoved/bandwidth

•  #messages*latency

•  Exponen?allygrowinggapsbetween•  Time_per_flop<<1/NetworkBW<<NetworkLatency

•  Improving59%/yearvs26%/yearvs15%/year

•  Time_per_flop<<1/MemoryBW<<MemoryLatency•  Improving59%/yearvs23%/yearvs5.5%/year

•  Goal:reorganize/tunemo?fstoavoidcommunica?on•  Not justhidingcommunica?on(speedup≤2x)

•  Arbitraryspeedupspossibleforlinearalgebra•  Successmetric–aWainlowerboundsoncommunica?on

communica?on

Page 13: Recent Progress in Autotuning at Berkeley

LinearAlgebraCollaborators(sofar)•  UCBerkeley

–  KathyYelick,MingGu–  MarkHoemmen,MarghoobMohiyuddin,KaushikDaWa,GeorgePetropoulos,

SamWilliams,BeBOpgroup–  LennyOliker,JohnShalf–  ParLab/UPCRC–newparallelcompu?ngresearchcenterfundedbyInteland

Microsop•  CUDenver

–  JulienLangou•  INRIA

–  LauraGrigori,HuaXiang•  GeorgiaTech

–  RichVuduc

•  Densework⊆ongoingdevelopmentofSca/LAPACK–  JointwithUTK

Page 14: Recent Progress in Autotuning at Berkeley

Communica?onLowerBoundsforDenseLinearAlgebra(1/2)

•  Hong‐Kung(1981),Irony/Tishkin/Toledo(2004)•  Usualmatmulusing2n3flops•  Sequen?alcase,withfastmemoryofsizeW<3n2

–  Thm:#wordsmoved=Ω(n3/W1/2)

•  Parallelcase,Pprocs,O(n2/P)memory/proc–  Thm:#wordsmoved=Ω(n2/P1/2)

•  Bandwidthlowerbound⇒latencylowerbound–  #messages≥#wordsmoved/(fast,local)memorysize

–  Sequen?al:#messages=Ω(n3/W3/2)

–  Parallel:#messages=Ω(P1/2)

Page 15: Recent Progress in Autotuning at Berkeley

Communica?onLowerBoundsforDenseLinearAlgebra(2/2)

•  SamelowerboundsapplytoLUandQR–  Assump?on:O(n3)algorithms;LUiseasybutQRissubtle

•  LAPACKandScaLAPACKdonotaWainthesebounds–  ScaLAPACKaWainsbandwidthbound

–  ButsendsO((mn/P)1/2)?mesmoremessages–  LAPACKaWainsneither;O((m2/W)1/2)xmorewordsmoved

•  ButnewalgorithmsdoaWainthem,modpolylogfactors–  ParallelQR:requiresnewrepresenta?onofQ,redundantflops–  Sequen?alQR:canusenewone,orrecursive–  ParallelLU:newpivo?ngstrategyneeded,stableinprac?ce–  Sequen?alLU:canusenewoneorrecursive,buthigherlatency

Page 16: Recent Progress in Autotuning at Berkeley

MinimizingCommunica?oninParallelQR

W=

W0 W1 W2 W3 

R00 R10 R20 R30 

R01 

R11 

R02 

• QRdecomposi?onofmxnmatrixW,m>>n• TSQR=“TallSkinnyQR”• Pprocessors,blockrowlayout

• UsualParallelAlgorithm• ComputeHouseholdervectorforeachcolumn• Numberofmessages∝nlogP

• Communica?onAvoidingAlgorithm• Reduc?onopera?on,withQRasoperator• Numberofmessages∝logP‐op?mal

Page 17: Recent Progress in Autotuning at Berkeley

TSQRinmoredetail

Qisrepresentedimplicitlyasaproduct(treeoffactors)

Page 18: Recent Progress in Autotuning at Berkeley

MinimizingCommunica?oninTSQR

W=

W0 W1 W2 W3 

R00 R10 R20 R30 

R01 

R11 

R02 Parallel:

W=

W0 W1 W2 W3 

R01 R02 

R00 

R03 Sequen?al:

W=

W0 W1 W2 W3 

R00 R01 

R01 R11 

R02 

R11 R03 

DualCore:

Choosereduc?ontreedynamically

Mul?core/Mul?socket/Mul?rack/Mul?site/Out‐of‐core:?

Page 19: Recent Progress in Autotuning at Berkeley

PerformanceofTSQRvsSca/LAPACK•  Parallel

–  Pen?umIIIcluster,DolphinInterconnect,MPICH•  Upto6.7xspeedup(16procs,100Kx200)

–  BlueGene/L•  Upto4xspeedup(32procs,1Mx50)

–  BothuseElmroth‐Gustavsonlocally–enabledbyTSQR•  Sequen?al

–  OOConPowerPClaptop•  AsliWleas2xslowdownvs(predicted)infiniteDRAM

•  SeeUCBerkeleyEECSTechReport2008‐74•  Beingrevisedtoaddop?malityresults!

•  GeneralrectangularQR•  TSQRforpanelfactoriza?on,thenright‐looking•  Implementa?onunderway(JulienLangou)

Page 20: Recent Progress in Autotuning at Berkeley

ModeledSpeedupsofCAQRvsScaLAPACK

Petascaleupto22.9x

IBMPower5upto9.7x

“Grid”upto11x

Petascalemachinewith8192procs,eachat500GFlops/s,abandwidthof4GB/s.

Page 21: Recent Progress in Autotuning at Berkeley

TSLU:LUfactoriza?onofatallskinnymatrix

Firsttrytheobviousgeneraliza?onofTSQR:

Page 22: Recent Progress in Autotuning at Berkeley

GrowthfactorforTSLUbasedfactoriza?on

UnstableforlargePandlargematrices.WhenP=#rows,TSLUisequivalenttoparallelpivo?ng.

CourtesyofH.Xiang

Page 23: Recent Progress in Autotuning at Berkeley

MakingTSLUStable

•  Ateachnodeintree,TSLUselectsbpivotrowsfrom2bcandidatesfromits2childnodes

•  Ateachnode,doLUon2boriginalrowsselectedbychildnodes,notUfactorsfromchildnodes

•  WhenTSLUdone,permutebselectedrowstotopoforiginalmatrix,redobstepsofLUwithoutpivo?ng

•  CALU–Communica?onAvoidingLUforgeneralA–  UseTSLUforpanelfactoriza?ons–  Applytorestofmatrix–  Cost:redundantpanelfactoriza?ons

•  Benefit:–  Stableinprac?ce,butnot samepivotchoiceasGEPP–  b?mesfewermessagesoverall‐faster

Page 24: Recent Progress in Autotuning at Berkeley

GrowthfactorforbeWerCALUapproach

Likethresholdpivo?ngwithworstcasethreshold=.33,so|L|<=3Tes?ngshowsaboutsameresidualasGEPP

Page 25: Recent Progress in Autotuning at Berkeley

PerformancevsScaLAPACK•  TSLU

–  IBMPower5•  Upto4.37xfaster(16procs,1Mx150)

–  CrayXT4•  Upto5.52xfaster(8procs,1Mx150)

•  CALU–  IBMPower5

•  Upto2.29xfaster(64procs,1000x1000)–  CrayXT4

•  Upto1.81xfaster(64procs,1000x1000)•  Op?malityanalysisanalogoustoQR•  SeeINRIATechReport6523(2008)

Page 26: Recent Progress in Autotuning at Berkeley

Petascalemachinewith8192procs,eachat500GFlops/s,abandwidthof4GB/s.

Speeduppredic?onforaPetascalemachine‐upto81xfasterP=8192

Page 27: Recent Progress in Autotuning at Berkeley

RelatedWorkandContribu?onsforDenseLinearAlgebra

•  Relatedwork– Pothen&Raghavan(1989)– Toledo(1997),Rabani&Toledo(2001)– Dunha,Becker&PaWerson(2002)– Gunter&vandeGeijn(2005)

•  Ourcontribu?ons– QR:2Dparallel,efficient1DparallelQR– Unifiedapproachtosequen?al,parallel,…– ParallelLU– Communica?onlowerbounds,provingop?mality

Page 28: Recent Progress in Autotuning at Berkeley

28

Heterogeneous(CPU+GPU)PerformanceLibraries

VasilyVolkov

Page 29: Recent Progress in Autotuning at Berkeley

ChallengesinprogrammingGPUs

29

Relatively little tuning information for GPU hardware -  Hardware exposed via abstract programming model -  Hard to develop accurate performance models -  Hardware changing rapidly

We get ~2x speedups vs. best other work on variety of libraries -  Matrix-matrix multiply (1.6x speedup) -  FFT (2x speedup) -  LU/Cholesky factorization (3x/2x speedups) -  Tridiagonal eigenvalue solver (2x speedup)

Example: partial pivoting in LU factorization on NVIDIA GPU -  Ad: “GPU supports gather/scatter with arbitrary access patterns” -  LAPACK code (sgetrf) spends 50% of time doing pivoting -  Gets worse with larger matrices -  Only 1% of time is spent in pivoting if matrix is transposed

-  2x speedup!

Page 30: Recent Progress in Autotuning at Berkeley

Matrix‐matrixmul?ply(SGEMM)

30

- Uses software prefetching, strip-mining, register blocking - Resembles algorithms used for Cray X1 and IBM 3090 VF - 60% of peak, bound by access to on-chip shared memory

Page 31: Recent Progress in Autotuning at Berkeley

FFT,complex‐to‐complex,batched

31

Results as on NVIDIA GeForce 8800GTX Resembles algorithms used for vector processors Radix-8 in registers, transposes using shared memory

0

50

100

150

200

250

1 8 64 512 4096 32768

benc

hFFT

Gflo

p/s

N

registerlstorage

CUFFT1.1

ourbest

Page 32: Recent Progress in Autotuning at Berkeley

HeterogeneousCompu?ng

32

Question: should you port entire applications to GPU? – Maybe not

Example: Eigenvalue solver (bisection, LAPACK’s sstebz) -  Most work (O(n2)) in vectorized, embarrassingly parallel code

-  Do it on GPU and finely tune -  May do many redundant flops on GPU to go faster

-  multisection -  Run on CPU when problem too small for GPU -  Use performance model to choose CPU or GPU version

-  Little work (O(n)) in nearly serial code -  Do it on CPU: time-consuming and non-trivial to port

Independent project: - Did everything on the GPU - Used advanced parallel algorithms for O(n) work - Up to 2x slower running on faster GPU (compared to us)

Page 33: Recent Progress in Autotuning at Berkeley

BreakdownofRun?me(sstebz)

33

“Count(x)” is the O(n2) work, use either SSE or GPU “the rest” is the O(n) work, isn’t worth optimizing

Page 34: Recent Progress in Autotuning at Berkeley

LUFactoriza?on

34

LU factorization: -  O(n3) work in bulky BLAS3 operations -  O(n2) work in fine-grain BLAS1 / BLAS2 (panel factorization)

Panel factorization is usually faster on the CPU - Peak sustained bandwidth of 8800GTX is 75GB/s - Kernel launch overhead is 5µs -  BLAS1 and BLAS2 are bandwidth bound, so run at

-  compare vs. CPU handicapped by CPU-GPU transfers -  Faster on CPU up to n ~ 16K on NVIDIA 8800GTX

Page 35: Recent Progress in Autotuning at Berkeley

OverallPerformance

35

Page 36: Recent Progress in Autotuning at Berkeley

OverlapinLUfactoriza?on

36

Overlap panel factorization on CPU with sgemm on GPU

Page 37: Recent Progress in Autotuning at Berkeley

LUslowdownsfromomiengop?miza?ons

Page 38: Recent Progress in Autotuning at Berkeley

SummaryofGPUExperience•  Heterogeneityexpandstuningspace

– DividingworkbetweenCPUandGPU•  Dependsalotonflop‐rate/bandwidth/latency/startup

– DifferentdatalayoutsmaybebestforCPUandGPU•  RowwisevscolumnwiseforLU

– Maywanttodo(many)extraflopsonGPU•  Mul?sec?onvsBisec?onineigensolver

•  TRINV+TRMMvsTRSMinLU

•  Rapidevolu?onofGPUarchitecturesmo?vatesautotuning

Page 39: Recent Progress in Autotuning at Berkeley

Whyallourproblemsaresolvedfordenselinearalgebra–intheory

•  Thm(D.,Dumitriu,Holtz,Kleinberg)(Numer.Math.2007)–  GivenanymatmulrunninginO(nω)opsforsomeω>2,itcanbe

madestableands?llruninO(nω+ε)ops,foranyε>0.

•  Currentrecord:ω≈2.38•  Thm(D.,Dumitriu,Holtz)(Numer.Math.2008)

–  GivenanystablematmulrunninginO(nω+ε)ops,candobackwardstabledenselinearalgebrainO(nω+ε)ops:

•  GEPP,QR(recursive)•  rankrevealingQR(randomized)•  (Generalized)Schurdecomposi?on,SVD(randomized)

•  Alsoreducescommunica?ontoO(nω+ε)•  Butconstants?

39

Page 40: Recent Progress in Autotuning at Berkeley

AvoidingCommunica?oninSparseLinearAlgebra‐Summary

•  TakekstepsofKrylovsubspacemethod– GMRES,CG,Lanczos,Arnoldi– Assumematrix“well‐par??oned,”withmodestsurface‐to‐volumera?o

–  Parallelimplementa?on•  Conven?onal:O(klogp)messages•  “New”:O(logp)messages‐op?mal

–  Serialimplementa?on•  Conven?onal:O(k)movesofdatafromslowtofastmemory•  “New”:O(1)movesofdata–op?mal

•  Canincorporatesomeprecondi?oners– Hierarchical,semiseparablematrices…

•  Lotsofspeeduppossible(modeledandmeasured)–  Price:someredundantcomputa?on

Page 41: Recent Progress in Autotuning at Berkeley

TuningCollec?ves•  RajeshNishtala

•  Collec?vescalledbyallthreadstoperformgloballycommunica?on–  Broadcase,reduc?on,all‐to‐all

•  Needtotunebecausebestalgorithmdependson–  #procs,–  Networklatency–  Networkbandwidth,–  Transfersize,etc.

•  FocusonPGASlanguagesandone‐sidedcommunica?onmodels–  E.g.UPC,Titanium,Co‐ArrayFortran

41

Page 42: Recent Progress in Autotuning at Berkeley

ExampleTreeTopologies

Radix4k‐nomialtree(quadnomial)

Radix2k‐nomialtree(binomial)

42BinaryTree ForkTree

Page 43: Recent Progress in Autotuning at Berkeley

43

DistributedMemoryBroadcast

•  100016Bytebroadcasts•  Bestimplementa?onis

pla�ormspecific•  Lowcosttoinject

message⇒wantflattree(6‐ary)

•  Highcost⇒wantdeeptreetoparallelizetheoverhead(binomial)

GOOD

Page 44: Recent Progress in Autotuning at Berkeley

44

DistributedMemoryAll‐to‐All

•  Op?malalgorithmvariesbasedontransfersize–  O(PlogP)algorithm

doublesmessagesizeapereveryround

–  Expensiveastheprocessorcountgrows

•  Cross‐overpointispla�ormspecific G

OOD

Page 45: Recent Progress in Autotuning at Berkeley

Collec?veSynchroniza?on

•  UPCpresentsnewseman?csforcollec?vesynchroniza?on•  Loosesynchroniza?onletscollec?vebeginaperany

threadarrives(looserthanMPIseman?cs)•  Strictsynchroniza?onrequiresallthreadstoarrive

beforecommunica?oncanstart(stricterthanMPI)•  Synchroniza?onseman?csaffectsalgorithmchoice

Page 46: Recent Progress in Autotuning at Berkeley

Mul?coreBarrierPerformanceResults•  Timemanyback‐to‐backbarriersonDual

SocketSunNiagra2(128hardwarethreads)

•  Flattreeisjustonelevelwithallthreadsrepor?ngtothread0

–  Leveragessharedmemorybutnon‐scalable

•  ArchitectureIndependentTree(radix=2)

–  Pickageneric“good”radixthatissuitableformanypla�orms

–  Mismatchedtoarchitecture

•  ArchitectureDependentTree

–  Searchoverallradicestopickthetreethatbestmatchesthearchitecture

–  Searchrevealedradix8tobethebest

GOOD

0

2

4

6

8

10

12

14

16

18

2 4 8 16 32 64 128

ExecuM

onTim

e(us)

ThreadCount

FlatTree

ArchitectureIndependent(radix=2)

ArchitectureDependent(radix=best)

Niagara2(Maramba)BarrierPerformance

46

GOOD

Page 47: Recent Progress in Autotuning at Berkeley

SummaryandConclusions(1/2)

•  Possibletominimizecommunica?oncomplexityofmuchdenseandsparselinearalgebra–  Prac?calspeedups– Approachingtheore?callowerbounds

•  Hardwaretrendsmeanthe?mehascometodothis

•  Op?malasympto?ccomplexityalgorithmsfordenselinearalgebra–alsolowercommunica?on

•  Importanttoop?mizecommunica?oncollec?vesthemselves

Page 48: Recent Progress in Autotuning at Berkeley

SummaryandConclusions(2/2)

•  Manyopenproblems– Howtoaddthesetoautoma?ctuningdesignspace

•  PLASMA(JakubKurzak)

– Extendop?malityproofs,algorithmstomoregeneralarchitectures•  Heterogeneity,includingmul?plebandwidthsandlatencies

– Denseeigenvalueproblems–SBRorspectralD&C?– Sparsedirectsolvers–CALUorSuperLU?– Whichprecondi?onerswork?

– Othermo?fs

Page 49: Recent Progress in Autotuning at Berkeley

AnswerstoQues?onsfromOrganizers(1/5)

•  Whatabouttuningonpetascale?–  Communica?onavoidingalgorithmshelpaddressit–  Canleveragemul?coretuning,ifwecanmirrorhierarchyofmachinesinhierarchyofalgorithms/tuners

•  Howdowemeasuresuccessfortuning?–  PerformancegainsaWained–  Easeofusebyend‐user

–  Easeofdevelopment/evolu?on/maintenanceoftuners

•  Whatarchitecturesshouldwetarget?–  Mul?coreandPetascaleasabove,–  EmergingarchitectureslikeGPUs

Page 50: Recent Progress in Autotuning at Berkeley

AnswerstoQues?onsfromOrganizers(2/5)

•  T/F:“ParameterTuning”isnotenough–  Manyofushavebeenchangingdatastructuresand

algorithmsforawhile,whichgoesbeyond“parameters”

•  T/F:Self‐tunedlibrarieswillbeatcompilers–  Whatisthestar?ngpointforthecompiler?Themostnaïve

possiblecode?Thenyes(butmayworkforsomemo?fs)

–  Butautotunersusecompilerstocompiletunedsourcecode

•  Willcompilerschangedatastructures/algorithms?–  Wouldtheys?llbecalledcompilersorsomethingelse?–  Howmuchwisdomaboutalgorithms,numericalanalysisand

appliedmathcanweexpectcompilerstoincorporate?

Page 51: Recent Progress in Autotuning at Berkeley

AnswerstoQues?onsfromOrganizers(3/5)•  Simpleperformancemodelswillmakesearch

unnecessary,eg“cache‐oblivious”–  Ifsearchiseasycomparedtofiguringouta“simple”

performancemodel,itismoreproduc?ve

–  “Cacheoblivious”wouldnothavefoundnewcommunica?onavoidingalgorithms

–  Simplemodelshelpdecidewhentostoptuning

•  BoundariesbetweenSWlayertoorigid–  Yes!SeeParLabstructureforonetakeonbreakingusual

layeringoftheHW/SWstack,egschedulers

Page 52: Recent Progress in Autotuning at Berkeley

AnswerstoQues?onsfromOrganizers(4/5)•  Whatissues/technologiesareweignoring?

–  Needmorecompiler‐basedexper?setobuildautotuners,sotheyareeasiertodesign/develop/maintain,andnotjustbigPERLscripsthatneedtobewriWenfromscratchforeachnewarchitectureorslightlydifferentfunc?on

•  Whatcommontoolsshouldwebuild?–  Seeanswertolastques?on;seePluto,talkbyChen

•  Tuningsystemsaretoonarrow,investincompilersinstead

–  Ifcompilersdonotchangedatastructuresoralgorithms,theywillmissalotofperformancegains

–  ParLabhypothesisisthat13mo?fs/tunersisenough–  Investincompilertoolstohelpbuildautotuners

Page 53: Recent Progress in Autotuning at Berkeley

AnswerstoQues?onsfromOrganizers(5/5)•  Run?meop?miza?onisneeded

–  Yes!Anumberofautotunersdothisnow,egJIT,OSKI

•  IfalllayersoftheSWstackareautotuned,howwilltheybecomposed?

–  Abigques?onforParLabtoo!Oneanswerformul?coreisinhavingschedulersforeachlibrarythatcanacceptorgiveupresources(cores)ontheflysothathigherlevelschedulerscanbalanceworkatahigherlevel

Page 54: Recent Progress in Autotuning at Berkeley

EXTRASLIDES

Page 55: Recent Progress in Autotuning at Berkeley

AvoidingCommunica?oninSparseLinearAlgebra‐Summary

•  TakekstepsofKrylovsubspacemethod– GMRES,CG,Lanczos,Arnoldi– Assumematrix“well‐par??oned,”withmodestsurface‐to‐volumera?o

–  Parallelimplementa?on•  Conven?onal:O(klogp)messages•  “New”:O(logp)messages‐op?mal

–  Serialimplementa?on•  Conven?onal:O(k)movesofdatafromslowtofastmemory•  “New”:O(1)movesofdata–op?mal

•  Canincorporatesomeprecondi?oners– Hierarchical,semiseparablematrices…

•  Lotsofspeeduppossible(modeledandmeasured)–  Price:someredundantcomputa?on

Page 56: Recent Progress in Autotuning at Berkeley

5 10 15 20 25 30

0

1

2

3

4

5

6

7

8

Local Dependencies for k=8

x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x

LocallyDependentEntriesfor[x,Ax],Atridiagonal,2processors

Canbecomputedwithoutcommunica?on

Proc1Proc2

Page 57: Recent Progress in Autotuning at Berkeley

5 10 15 20 25 30

0

1

2

3

4

5

6

7

8

Local Dependencies for k=8

x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x

Canbecomputedwithoutcommunica?on

Proc1Proc2

LocallyDependentEntriesfor[x,Ax,A2x],Atridiagonal,2processors

Page 58: Recent Progress in Autotuning at Berkeley

5 10 15 20 25 30

0

1

2

3

4

5

6

7

8

Local Dependencies for k=8

x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x

Canbecomputedwithoutcommunica?on

Proc1Proc2

LocallyDependentEntriesfor[x,Ax,…,A3x],Atridiagonal,2processors

Page 59: Recent Progress in Autotuning at Berkeley

5 10 15 20 25 30

0

1

2

3

4

5

6

7

8

Local Dependencies for k=8

x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x

Canbecomputedwithoutcommunica?on

Proc1Proc2

LocallyDependentEntriesfor[x,Ax,…,A4x],Atridiagonal,2processors

Page 60: Recent Progress in Autotuning at Berkeley

5 10 15 20 25 30

0

1

2

3

4

5

6

7

8

Local Dependencies for k=8

x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x

LocallyDependentEntriesfor[x,Ax,…,A8x],Atridiagonal,2processors

Canbecomputedwithoutcommunica?onk=8foldreuseofA

Proc1Proc2

Page 61: Recent Progress in Autotuning at Berkeley

5 10 15 20 25 30

0

1

2

3

4

5

6

7

8

Type (1) Remote Dependencies for k=8

RemotelyDependentEntriesfor[x,Ax,…,A8x],Atridiagonal,2processors

x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x

Onemessagetogetdataneededtocomputeremotelydependententries,not k=8 Minimizesnumberofmessages=latencycostPrice:redundantwork∝“surface/volumera?o”

Proc1Proc2

Page 62: Recent Progress in Autotuning at Berkeley

5 10 15 20 25 30

0

1

2

3

4

5

6

7

8

Type (2) Remote Dependencies for k=8

FewerRemotelyDependentEntriesfor[x,Ax,…,A8x],Atridiagonal,2processors

x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x

Reduceredundantworkbyhalf

Proc1Proc2

Page 63: Recent Progress in Autotuning at Berkeley

RemotelyDependentEntriesfor[x,Ax,A2x,A3x],Airregular,mulMpleprocessors

Page 64: Recent Progress in Autotuning at Berkeley

SequenMal[x,Ax,…,A4x],withmemoryhierarchy

v

Onereadofmatrixfromslowmemory,not k=4 Minimizeswordsmoved=bandwidthcost

Noredundantwork

Page 65: Recent Progress in Autotuning at Berkeley

PerformanceResults

•  Measured– Sequen?al/OOCspeedupupto3x

•  Modeled– Sequen?al/mul?corespeedupupto2.5x

– Parallel/Petascalespeedupupto6.9x– Parallel/Gridspeedupupto22x

•  Seebebop.cs.berkeley.edu/#pubs

Page 66: Recent Progress in Autotuning at Berkeley

Op?mizingCommunica?onComplexityofSparseSolvers

•  Example:GMRESforAx=bon“2DMesh”– xlivesonn‐by‐nmesh– Par??onedonp½‐by‐p½grid– A has“5pointstencil”(Laplacian)

•  (Ax)(i,j)=linear_combina?on(x(i,j),x(i,j±1),x(i±1,j))– Ex:18‐by‐18meshon3‐by‐3grid

Page 67: Recent Progress in Autotuning at Berkeley

MinimizingCommunica?onofGMRES

•  Whatisthecost=(#flops,#words,#mess)ofkstepsofstandardGMRES?

GMRES,ver.1:fori=1tokw=A*v(i‐1)MGS(w,v(0),…,v(i‐1))updatev(i),HendforsolveLSQproblemwithH n/p½

n/p½

• Cost(A*v)=k*(9n2/p,4n/p½,4)• Cost(MGS)=k2/2*(4n2/p,logp,logp)• Totalcost~Cost(A*v)+Cost(MGS)• Canwereducethelatency?

Page 68: Recent Progress in Autotuning at Berkeley

MinimizingCommunica?onofGMRES•  Cost(GMRES,ver.1)=Cost(A*v)+Cost(MGS)

• Cost(W)=(~same,~same,8)• Latencycostindependentofk–opMmal

• Cost(MGS)unchanged• Canwereducethelatencymore?

=(9kn2/p,4kn/p½,4k)+(2k2n2/p,k2logp/2,k2logp/2)

• HowmuchlatencycostfromA*vcanyouavoid?Almostall

GMRES,ver.2:W=[v,Av,A2v,…,Akv][Q,R]=MGS(W)BuildHfromR,solveLSQproblem

k=3

Page 69: Recent Progress in Autotuning at Berkeley

MinimizingCommunica?onofGMRES•  Cost(GMRES,ver.2)=Cost(W)+Cost(MGS)

=(9kn2/p,4kn/p½,8)+(2k2n2/p,k2logp/2,k2logp/2)

• HowmuchlatencycostfromMGScanyouavoid?Almostall

• Cost(TSQR)=(~same,~same,logp)• Latencycostindependentofs‐opMmal

GMRES,ver.3:W=[v,Av,A2v,…,Akv][Q,R]=TSQR(W)…“TallSkinnyQR”BuildHfromR,solveLSQproblem

W=

W1W2W3W4

R1R2R3R4

R12

R34

R1234

Page 70: Recent Progress in Autotuning at Berkeley

MinimizingCommunica?onofGMRES•  Cost(GMRES,ver.2)=Cost(W)+Cost(MGS)

=(9kn2/p,4kn/p½,8)+(2k2n2/p,k2logp/2,k2logp/2)

• HowmuchlatencycostfromMGScanyouavoid?Almostall

• Cost(TSQR)=(~same,~same,logp)• Oops

GMRES,ver.3:W=[v,Av,A2v,…,Akv][Q,R]=TSQR(W)…“TallSkinnyQR”BuildHfromR,solveLSQproblem

W=

W1W2W3W4

R1R2R3R4

R12

R34

R1234

Page 71: Recent Progress in Autotuning at Berkeley

MinimizingCommunica?onofGMRES•  Cost(GMRES,ver.2)=Cost(W)+Cost(MGS)

=(9kn2/p,4kn/p½,8)+(2k2n2/p,k2logp/2,k2logp/2)

• HowmuchlatencycostfromMGScanyouavoid?Almostall

• Cost(TSQR)=(~same,~same,logp)• Oops–Wfrompowermethod,precisionlost!

GMRES,ver.3:W=[v,Av,A2v,…,Akv][Q,R]=TSQR(W)…“TallSkinnyQR”BuildHfromR,solveLSQproblem

W=

W1W2W3W4

R1R2R3R4

R12

R34

R1234

Page 72: Recent Progress in Autotuning at Berkeley
Page 73: Recent Progress in Autotuning at Berkeley

MinimizingCommunica?onofGMRES•  Cost(GMRES,ver.3)=Cost(W)+Cost(TSQR)

=(9kn2/p,4kn/p½,8)+(2k2n2/p,k2logp/2,logp)

• Latencycostindependentofk,justlogp–op?mal• Oops–Wfrompowermethod,soprecisionlost–Whattodo?

• Useadifferentpolynomialbasis• NotMonomialbasisW=[v,Av,A2v,…],instead…• NewtonBasisWN=[v,(A–θ1I)v,(A–θ2I)(A–θ1I)v,…]or• ChebyshevBasisWC=[v,T1(v),T2(v),…]

Page 74: Recent Progress in Autotuning at Berkeley
Page 75: Recent Progress in Autotuning at Berkeley

RelatedWorkandContribu?onsforSparseLinearAlgebra

•  Relatedwork–  s‐stepGMRES:DeSturler(1991),Bai,Hu&Reichel(1991),

Joubertetal(1992),Erhel(1995)

–  s‐stepCG:VanRosendale(1983),Chronopoulos&Gear(1989),Toledo(1995)– MatrixPowersKernel:Pfeifer(1963),Hong&Kung(1981),

Leiserson,Rao&Toledo(1993),Toledo(1995),Douglas,Hu,Kowarschik,Rüde,Weiss(2000),Strout,Carter&Ferrante(2001)

•  Ourcontribu?ons–  Unifiedapproachtoserialandparallel,useofTSQR–  Op?mizingserialcaseviaTSP

–  Unifiedapproachtostability–  Incorpora?ngprecondi?oning

Page 76: Recent Progress in Autotuning at Berkeley

SummaryandConclusions(1/2)

•  Possibletominimizecommunica?oncomplexityofmuchdenseandsparselinearalgebra– Prac?calspeedups– Approachingtheore?callowerbounds

•  Op?malasympto?ccomplexityalgorithmsfordenselinearalgebra–alsolowercommunica?on

•  Hardwaretrendsmeanthe?mehascometodothis

•  Lotsofpriorwork(seepubs)–andsomenew

Page 77: Recent Progress in Autotuning at Berkeley

SummaryandConclusions(2/2)

•  Manyopenproblems– Automa?ctuning‐buildandop?mizecomplicateddatastructures,communica?onpaWerns,codeautoma?cally:bebop.cs.berkeley.edu

– Extendop?malityproofstogeneralarchitectures– Denseeigenvalueproblems–SBRorspectralD&C?

– Sparsedirectsolvers–CALUorSuperLU?– Whichprecondi?onerswork?– Whystopatlinearalgebra?