delft university of technologyta.twi.tudelft.nl/twa_reports/00/00-02.pdf · delft university of...

24
DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 C OMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE SUBDOMAIN SOLUTION K. DEKKER ISSN 1389-6520 Reports of the Department of Applied Mathematical Analysis Delft 2000

Upload: others

Post on 02-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

DELFT UNIVERSITY OF TECHNOLOGY

REPORT 00-02

COMPARING GMRES AND P-GMRES IN DOMAIN

DECOMPOSITION WITH APPROXIMATE SUBDOMAIN

SOLUTION

K. DEKKER

ISSN 1389-6520

Reports of the Department of Applied Mathematical Analysis

Delft 2000

Page 2: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

Copyright 2000by Departmentof AppliedMathematicalAnalysis,Delft, TheNetherlands.

No partof theJournalmaybereproduced,storedin aretrieval system,or transmitted,in any formor by any means,electronic,mechanical,photocopying, recording,or otherwise,without theprior written permissionfrom Departmentof Applied MathematicalAnalysis,Delft Universityof Technology, TheNetherlands.

Page 3: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

Comparing GMRES and P-GMRES in DomainDecomposition with approximate subdomain solution

K. Dekker

Abstract

Solutionof largelinearsystemsencounteredin computationalfluid dynamicsoftenleadsto someform of domaindecomposition,especiallywhen it is desiredto useparallelma-chines. In this paperP-GMRES,a partitionedmodificationof GMRES,is appliedto suchproblems.It is shown thatP-GMRESconvergesfasterthanGMRESif thesubdomainsaresolved exactly, andthat P-GMRESrequireslesscommunicationin the computationof theinnerproducts.Also, approximatesolutionsfor thesubdomainsby aninnerpreconditionedGMRES iterationare considered,in combinationwith restartedversionsof GMRES andP-GMRES.Weinvestigatetheeffectof thetolerancein thesubdomainproblemsonthecon-vergenceof theouteriteration,andon the total amountof work in numericalexperiments.It turnsout that rathercrudetolerancesareallowed,andthata goodstrategy is to vary thetolerancefor thesubdomainsin thecourseof theouteriteration.

Keywords: Domaindecomposition;Parallel GMRESmethods;Approximatesubdomainsolu-tion; Orthogonalisationmethods

1 Introduction.

Domaindecompositionarisesnaturally in computationalfluid dynamicsapplicationson struc-turedgrids: complicatedgeometriesarebroken down into (topologically) rectangularregionsanddiscretised,seee.g.[23, 30], andby solvingsubproblemson theseregionsonearrivesat thesolutionon theglobaldomain.This approachprovideseasyexploitationof parallelcomputingresources,andadditionallyoffersa solutionto memorylimitation problems.

FrankandVuik [11] addresstheparallelimplementationof a domaindecompostionmethodfor theDeFTNavier-Stokessolver describedin [23]. Their paperis a continuationof work byBrakkee,summarisedin [5] andpresentedin [6], wherea serialimplementationof nonoverlap-ping, one-level additiveSchwarzmethodwith approximatesubdomainsolutiongave promising

Faculty of Mathematicsand Informatics,Delft University of Technology, Mekelweg 4, 2628CDDelft, TheNetherlands;E-mail: [email protected]

1

Page 4: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

results.In [11] theGCRmethodin combinationwith inaccuratesubdomainsolutionis testedforaPoissonproblemonasquaredomain,whichis representativefor thesystemto besolvedfor thepressurecorrectionmethodusedin DeFT. Ourpresentgoalis to evaluatethepartitionedmethod,P-GMRES,describedin [9], in combinationwith accurateandinaccuratesubdomainsolutiononsuchaproblem.

Theoreticalresultson approximatesolutionof subproblemsfor Schurcomplementdomaindecompositionmethodsaregiven by Borgers[4], Haaseet al. [12, 13, 14, 20]. Tan [25] andBrakkee[5] givetheoreticalresultsfor nonoverlappingSchwarziterationswith approximatesub-domainsolvers.

In thispaperwedemonstratefor anonoverlapping,additiveSchwarzmethodthatP-GMRESconvergesfasterthanGMRES,if thesubdomainsaresolvedexactly, andthatrestartedversionsof bothmethodscanbeappliedif thesubproblemsaresolvedwith moderateaccuracy. Weshowthat thecomputationalwork of themethodsis aboutthesameper iteration,andthatP-GMRESrequireslesscommunication,so it canbe moreefficiently parallelised.On theotherhand,theapplicabilityof P-GMRESis restrictedto theclassof problemsfor which a red-blackcolouringof subdomainsexistssuchthatno adjacentsubdomainshave thesamecolour.

In Section2 we briefly review the relevant mathematicsand presentthe GMRES and P-GMRESagorithms.Mucheffort hasfocussedon theefiicientparallelisationof Krylov subspacemethods.Thecomputationandcommunicationsof themany innerproductsoften limitatestheattainablespeedupon many processors.Therefore,authorshave tried to overlapinner productcommunicationwith computation[8], or to increasethe numberof inner productsthat canbecomputedwith a single communication[2, 18, 8]. Frank and Vuik [11] suggestto increasetheamountof computationto reducethenumberof communicationsin GCR.We show thatP-GMRESrequireslesscommunicationthanGCRor GMRES,andthat thecommunicationscanbeeasilyoverlappedwith computation.

In Section2.4 we addressthe solutionof the subdomainproblems.We give evidencethattherestartedversionsof GMRESandP-GMRESareapplicablein combinationwith anapprox-imatesubdomainsolution. Moreover, we show thata variableprecisionin thesolutionof thesesubproblemsis likely to bemostefficient,which is confirmedby experimentsin Section4.

A performancemodelfor the orthogonalisationsin (P-)GMRES,derived from [11], is pre-sentedin Section3. Theoreticalspeedupratios for P-GMRESon a workstationclusterandaCray T3E, basedon this model,aregiven. We alsodevelop a model for the costsof varioussubdomainsolverswhich is usedfor theevaluationof theresultsin Section4.

In Section4 wecomparetheconvergencerateof GMRESandP-GMRESwith variousmulti-block preconditionerson thetestproblemfrom [11]. We alsoreportresultsfor P-GMRESwithapproximatesubdomainsolvers,in combinationwith several strategiesfor the tolerancein theinneriterations.Our resultssuggestthatavariableprecision,decreasingduringthecourseof theouteriterations,is mostefficient. Accordingto theperformancemodel,however, exactsolutionwill becheapestfor therelatively smallsubdomains(lessthan22500grid points)consideredinthetests.

2

Page 5: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

2 Mathematical background

2.1 Nonoverlapping domain decomposition

Weconsideran(elliptic) partialdifferentialequationdiscretisedusingafinite differenceor finitevolumemethodon a computationaldomainΩ. By a computationaldomainwe meana setofunknown valuesto beapproximated,togetherwith their locationsin space.We supposethatΩis theunionof M nonoverlappingsubdomainsΩm m 1 M.

Discretisationof thePDEresultsin asparselinearsystem

Ax b (1)

with x b N . Thestructureof thematrix A is determinedby thestencilof thediscretisation.Evenif thereis nooverlapbetweenthesubdomains,thereis aninter-subdomaincouplingduetothe stencil. Groupingtogetherinto blocksthoseunknownswhich sharea commonsubdomainwill permutethesystem(1) to produceablocksystem: A11 A1M

.... . .

...AM1 AMM

x1...

xM

b1...

bM

(2)

Here,thediagonalblocksAmm expressthecouplingamongunknownsdefinedonacommonsub-domain Ωm , whereastheoff-diagonalblocksAmn m n representcouplingacrosssubdomainboundaries.Theonly nonzerooff-diagonalblocksarethosecorrespondingto neighbouringsub-domains.Moreover, we will assumein thesequelthata red-blackcolouringof thesubdomainsexists,suchthatadjacentsubdomainshavea differentcolour, i.e. thereholds

Amn 0 m n if Ωm andΩn have thesamecolour. This restriction,which oftencanbesatisfiedin practice,isessentialfor thesolverP-GMRES.

TheadditiveSchwarziterationintroducestheblockJacobipreconditioner

K A11. . .

AMM

which, togetherwith theresidualb Ax i , definesasystemwhosesolutionprovidesanapprox-imationof theerrorx x i . Becausethis systemdecouplesinto M independentsystems,it canbesolvedefficiently onparallelcomputers.

This form of domaindecompositionhasalsobeenconsideredby FrankandVuik [11]. Fora thoroughdiscussionof domaindecompositionseethe book [24] and the review article [7],andtheextensive bibliographytherein.Roughlyspeaking,theconvergenceratesufferspropor-tionally to the numberof subdomainsin eachdirection. The convergencerate may be madeindependentof the grid sizeby usingconstantoverlapsor by applicationof coarsesubspacecorrection.Wewill accelaratetheconvergenceby aKrylov subspacemethod,asin [11].

3

Page 6: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

Algorithm: GMRES

1. Start: Let initial guessx0 begiven.q0 b Ax0 SolveKw q0 β ! w 2 v1 w " β

2. Arnoldi process:for k 1 2 #$ until convergence:% qk Avk% Solve Kw qk%'& vk ( 1 hk ) orthonorm w v j j * k end

3. Form approximate solution:% DefineVk & v1 v2 vk )% DefineHk & h1 h2 + hk )% Computexk x0 , Vkyk, whereyk argminy βe1 Hky 2 ande1 & 1 0 - 0) T .

4. Restart: ComputeAxk andcheckterminationcriterion. If satisfiedstop,elsesetx0 xk

andgoto1.

Figure1: TheGMRESalgrotihm

2.2 GMRES acceleration

In practice(2) is solved iteratively, usingK asa preconditionerfor a Krylov subspacemethod,suchastheconjugategradientmethodfor symmetricproblemsor theGMRESmethod[22] fornonsymmetricproblems.In contrastwith [11], whereGCRis used,weconsiderGMRES,shownin Fig. 1, in orderto facilitatethecomparisonwith P-GMRES[9].

Thefunctionorthonorm() takesinputvectorsw, orthonormalisesw with respectto thevi i *k, and returnsthe modified vector vk ( 1. In serial computations,the modified Gram-Schmidtmethod(MGS),shown in Fig. 2, is usuallyemployedfor theorthonorm() function.

In parallelcomputationsMGS hasseriousdisadvantages,becausetheinnerproductsrequireglobal communications,and thereforedo not scale. Moreover, theseinner productsmust becomputedsuccessively, andtheir numberincreasesby onein every iterationstep.Variousalter-nativeshave beenproposedfor MGS, e.g. orthogonalisinga numberof vectorssimultaneously[8, 17,18], Householdertransformations[29, 11] or two-fold applicationof theclassicalGram-Schmidtmethod(CGS) [16, 11]. However, thesealternativeshave somedrawbacks,varyingfrom lossof stability with respectto roundingerrors[3] to an increaseof thenumberof float-ing point operations.Also, mostalternativesarenot applicablewhena preconditioneris used

4

Page 7: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

Algorithm: Modified Gram-Schmidt& vk ( 1 hk ) orthonorm w v j j * k for j 1 2 k

hk . j 0/ w v j 1w w hk . jv j

end

hk . k ( 1 ! w vk ( 1 w " hk . k ( 1

Figure2: ThemodifiedGram-Schmidtalgorithm

that variesin eachiteration. FrankandVuik [11] concludefrom a comparisonwith MGS andHouseholderthatre-orthogonalisedCGS(seeFig. 3) is themostattractivemethod.

2.3 P-GMRES acceleration

Dekker [9] hasproposeda modificationof GMRES,calledPartitionedGMRES(P-GMRES),which is applicableto (2) if the subdomainscanbe partitionedinto two groups,suchthat foreachpair of differentsubdomainsfrom thesamegroupthecorrespondingblocksAmn andAnm

arezero.In [9] thetrivial caseM 2 is considered,but thissituationalsooccursif a red-blackcolour-

ing of the subdomainsis possiblewhereonly adjacentsubdomainsleadto nonzeroblocks. Inthesequelwe assumethatsucha colouringexists. Let therestrictionto theredsubdomainsbedenotedby Rr, andto theblackonesby Rb. Thenthe following equations,which areessentialfor P-GMRES,hold:

Rr , Rb I RrK 2 1ARrx Rrx RbK 2 1ARbx Rbx

P-GMRES,describedin Fig4,offersseveraladvantageswhencomparedto GMRES,whereasthe computationalcostsin an iterationareaboutthe same.First, P-GMRESyields an optimalapproximationin theaffinespace

x0 , Span3 V rk V b

k 4whichhasahigherdimensionthan

x0 , Span3 Vk 45

Page 8: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

Algorithm: ClassicalGram-Schmidt& vk ( 1 hk ) orthonorm w v j j * k for j 1 2 k

hk . j 0/ w v j 1end

hk . k ( 1 65 w 2 ∑kj 7 1 3 hk . j 4 2

vk ( 1 8 w ∑kj 7 1 3 hk . jv j 4 " hk . k ( 1

Figure3: TheclassicalGram-Schmidtalgorithm

in which subspaceGMRESsearchesfor anapproximation.Also, thereholds

Span3 Vk 4:9 Span3 V rk V b

k 4 soP-GMRESconvergesat leastasfastasGMRES.

Secondly, we have two independentorthogonalisationprocesses,onefor thevariablesfromtheredsubdomainsandonefor theblacksubdomains.This propertyallowsseveralpossibilitiesfor parallelisation.On a 2-processormachinewe could performtwo modifiedGram-Schmidtalgorithmswith only a communicationstepafter terminationof MGS. Whenmany processorsareavailable,we coulddivide theprocessorsinto two groups,eachtakingcareof oneorthogo-nalisationprocess,therebyslightly reducingtheamountof communication,asnotall processorsareinvolvedin thecomputationof aninnerproduct.In thiscaseit mightevenbemoreattractiveto overlapcomputationandcommunication[8] in anaturalway: computationof avectorupdateanda local innerproductfor theredsubdomainscanbedonesimultaneouslywith theaccumula-tion of theinnerproductsfor theblacksubdomains,viceversa,wheneachprocessoris assignedto both a red anda black subdomain.Then, the costsof the global communicationwould benegligible, providedthesubdomainsarenot toosmall.

Finally we note that the minimisationproblemin P-GMREScanbe cheaplysolved usingGivensrotations,justasin GMRES[22], becauseHr

k andHbk arebothHessenberg matrices.Due

to the larger size2k ; 2k of the coefficient matrix eachiterationsteprequires8k Givensrota-tions,comparedto only k rotationsin aGMRESiteration.Thisadditionalamountof operations,however, is usuallyverysmallcomparedto thecostsof theinnerproducts.

Further, the computationalwork in the Arnoldi processof P-GMRESis just the sameasin GMRES. The vectorsvr

k and vbk are both restrictedto part of the subdomains,so the two

matrix multiplicationswith A amountto just onemultiplication of A with a full vector. In thepreconditioningweneedto solve

Kwb qbk Kwr qr

k

only for theblack,viz. redsubdomains.

6

Page 9: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

Algorithm: P-GMRES

1. Start: Let initial guessx0 begiven.qr

0 Rr b Ax0 < SolveKwr qr0 βr ! wr 2 vr

1 wr " βr,qb

0 Rb b Ax0 = SolveKwb qb0 βb 6 wb 2 vb

1 wb " βb 2. Arnoldi process:

for k 1 2 #$ until convergence:% qbk RbAvr

k qrk RrAvb

k % Solve Kwb qbk Kwr qr

k %'& vrk ( 1 hr

k ) orthonorm wr vrj j * k =& vb

k ( 1 hbk ) orthonorm wb vb

j j * k =end

3. Form approximate solution:% DefineV rk & vr

1 vr2 vr

k ) V bk & vb

1 vb2 + vb

k ) % DefineHrk & hr

1 hr2 hr

k ) Hbk & hb

1 hb2 hb

k ) % Computexk x0 , V rk yr

k , V bk yb

k, whereyrk yb

k minimises>>>@? βre1 Ikyr Hrkyb

βbe1 Ikyb Hbk yr A >>>

2

e1 & 1 0 - 0) T B k ( 1 andIk is theidentity matrix,extendedwith a row of zeros.

4. Restart: ComputeAxk andcheckterminationcriterion. If satisfiedstop,elsesetx0 xk

andgoto1.

Figure4: TheP-GMRESalgrotihm

7

Page 10: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

2.4 Subdomain solution

Solution for w from the preconditioningequationKw q in the GMRES algorithm requiresthe solution of M independentsubdomainsystemsAmmwm qm m 1 M, and similarlywe have to solve M subdomainsystemsto obtain the solutionswr andwb from the equationsKwr qr Kwb qb in theP-GMRESalgorithm.In theformulationsof thealgorithmswe haveassumedthatthesesubdomainsaresolvedexactly. However, in practicalproblemsthis mightbetoo expensive,especiallywhenthesubsystemsarelarge,sothatsolutionby aniterativemethodmight be a betteralternative. It is generallythoughtthat the solutionobtainedshouldbe veryaccurate[6, 4], otherwiseGMRESaccelerationmayno longerbeapplied.In caseof inaccuratesubdomainsolutionsthemethodsGCR[26] andFGMRES[21], which alsoallow variablepre-conditioners,aremoreappropriate.However, thesemethodsrequiremorestorageand,in caseofGCR,additionalcomputations.Moreover, P-GMRESis not suitablefor very inaccuratesubdo-mainsolutions.Therefore,we requirethat thesubdomainproblemsaresolvedwith a moderateaccuracy, andthenrestartedversionsof GMRESarealsoapplicable,asthe following analysisshows. For resultsobtainedwith GCRandaninaccuratesubdomainsolutionwereferto [11].

Supposethat, in steadof solving Kwk qk exactly, we obtainan approximatesolutionwk,satisfying,

Kwk qk , Kεk qk Avk where

εk wk wk Define

Ek & ε1 ε2 εk )andlet wk k 0 1 beusedto generatetheKrylov subspace.Then,thereholds

K 2 1AVk Vk ( 1Hk Ek Consequently, afterm outeriterations,usingtheinexactsubdomainsolutions,we obtainfor

thepreconditionedresidual K 2 1 Axm b C K 2 1 Ax0 b D, K 2 1AVmym E Vm ( 1Hmym Emym K 2 1q0 E Vm ( 1Hmym Emym βVm ( 1e1 , ε0 F** Vm ( 1Hmym βVm ( 1e1 , ε0 Emym F** Hmym βe1 , ε0 , m

∑k 7 1

εk HG ym k G (3)

As ym IJ xm x0 will bebounded,weobservethattheinexactsubdomainsolutionsdonotaffect thepreconditionedresidualdramatically, aslong astheerrorsεk k 0 1 m aresmallcomparedto theestimationof thepreconditionedresidual Hmym βe1 , which is calculatedinGMRES.In theexperimentswe will investigatethe influenceof theaccuracy in thesolutionofthesubdomainproblemson theconvergenceof theouteriterationsof GMRESandP-GMRES.

8

Page 11: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

GMRES(MGS) k , 1 SIP(1) k axpyGMRES(CGS2) 2 SIP(k , 1) 2k axpyP-GMRES(MGS) k , 1 HSIP(2) k haxpy(2)

Table1: Numberof operationsin thek-th iterationof GMRESandP-GMRES

A secondquestionwhicharises,addressesthevalueof thetoleranceto which thesubdomainiterationsshouldconverge.Thereseemsto besometheoreticalevidence[25] thatafixedrelativetoleranceis optimal. However, in sucha modelit is usuallyassumedthatconvergenceis linearandindependentof previousiterations.This assumptionis obviouslynot valid for (P-)GMRES,wherethe convergenceis superlinearandthe iterationsareintimately related. Moreover, evenif a fixedtolerancewould beoptimal, this valueis not known beforehand.If theouteriterationconvergesslowly, too strict a tolerancewill not benecessaryin a restartedmethod.Also, whentheouteriterationhasalmostconverged,theaccuracy canberelaxed,asthecomponentsym k will getsmallfor increasingk, accordingto (extendingyk 2 1

K k 2 1 with additionalzeros)G ym k G<6G ym k yk 2 1 k GL*' ym yk 2 1 M*' xm xk 2 1 3 Performance models

To give insightinto thecostsof theorthogonalisationprocedurein GMRESandP-GMRES,andof the subdomainsolutionwe considersimpleperformancemodels. In the first subsectionwediscusstheorthogonalisation.Theemphasiswill thenbeonthecommunicationcostsfor parallelplatforms,asinnerproductsdistributedover the processorsareto be calculated.In thesecondsubsectionwe considerthe sequentialcostsof subdomainsolvers,as it is assumedthat eachprocessortakescareof thesolutionfor one(or more)subdomains.

3.1 Orthogonalisation

Thecostof orthogonalisationis mainlydeterminedby theinnerproductsandthevectorupdateswhich occur both in the MGS and the CGS algorithm. Here, we distinguishbetweeninnerproductsthat canbe computedsimultaneously(i.e., with a singlecommunication),thosethatcannotandinnerproductsfor half thevectorlength,asoccurin P-GMRES.Following [11], wedenotek simultaneousinnerproductsby SIP(k). Two innerproductsandvectorupdatesof halfthevectorlengtharedenotedbyHSIP(2)andhaxpy(2). Then,themodifiedandre-orthogonalisedGram-Schmidtcanbebrokendown into componentsasgivenin Table1.

Let thetime for communicationof a messageof n floatingpoint numbersbegivenby

tcomm t0 , βn wheret0 is thecommunicationstartuptime,andβ is thetimeperfloatingpointnumber, depend-ing on thebandwidth.Let thetime for n floatingpoint operationsbegivenby

tcomp φn 9

Page 12: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

Operation Communication Computation Definitionsend k t0 , βk Sendamessageof lengthkflop n nφ n floatingpointoperationsB p k f p t0 , βk Broadcastk elementsSIP k 2 f p t0 , βk 2kn2φ k simultaneousinnerproductsHSIP 2 2 f p " 2 t0 , β 2n2φ 2 simultaneoushalf IPsaxpy 2n2φ vectorupdatehaxpy 2 2n2φ 2 half vectorupdates

Table2: Communicationandcomputationtimes

Operation Communication Computation DefinitionHLIP n2φ Half local innerproductALIP 2 f p t0 , β Accumulationof HLIPshaxpy n2φ Half vectorupdate

Table3: Communicationandcomputationtimes

Suchamodelis usedin e.g.[15, 8, 11].Let p denotethenumberof processes,anddefinea function f p which givesthemaximum

numberof non-simultaneoussendsnecessaryfor a broadcastto p 1 processes.The functionf p is machinedependentandalsodependsonthedistributionof theprocessesonthemachine.Commonvaluesare, f p log2 p for a hypercubestructure,and f p p 1 for an Eth-ernetbroadcast.Assumingthat eachprocessoris responsiblefor an n ; n subdomainwith n2

unknowns,wearriveat thetimesfor thebasicoperationsasin Table2Basedon the communicationmodeloutlinedin Tables1 and2 the orthogonalisationtime

requiredfor s iterationsof GMRES(without restart)usingmodifiedGram-Schmidt(MGS), re-orthogonalisedclassicalGram-Schmidt(CGS2)andP-GMRESusingmodifiedGram-Schmidt(P-MGS),is givenby

tMGS s s , 3 & 2n2φ , f p t0 , β ) 2sn2φ (4)

tCGS2 s s , 3 & 4n2φ , 2 f p β ) , 4 s , 1 f p t0 4sn2φ (5)

tP 2 MGS s s , 3 & 2n2φ , f p " 2 t0 , β ) 2sn2φ (6)

Comparingtheseexpressions,we seethat the orthogonalisationin P-GMRESis slightlycheaperthanMGS in GMRES,unless2 processorsareused,in which caseP-GMRESrequiresno communication.On many processors,with high communicationstartuptime, CGS2seemsto befavourable.However, if it is possibleto overlapcomputationandcommunication,an im-plementationof P-GMRESmightbeconsideredwhereeachprocessoris assignedbotharedanda blacksubdomain,eachcontainingn2 " 2 unknowns. Then,theMGS algorithmfor theredandblacksubdomainscanbeperformedasin Fig. 5 (cf. [8]).

10

Page 13: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

Algorithm: Modified implementationof MGS in P-GMRESfor i 1 2 + k , 1% Accumulatelocal innerproductsi 1 for blacksubdomains,if i 1

Updatevectorson redsubdomains,if i 1Computelocal innerproductsi for redsubdomains% Accumulatelocal innnerproductsi for redsubdomainsUpdatevectorsonblacksubdomains,if i 1Computelocal innerproductsi for blacksubdomains

end% Accumulatelocal innerproductsk , 1 for blacksubdomains

Figure5: Modified implementationof MGS

Usingthecostsof eachbasicoperationasgivenin Table3, wearriveat theorthogonalisationtime for this modification(P-MGSM)

tP 2 MGSM s s , 1 max 2n2φ 2 f p t0 , β N,, s & n2φ , 2 f p t0 , β D, max n2φ 2 f p t0 , β ) (7)

From[11] wederive representativevaluesfor theparameterst0 β andφ

t0 O 4 7 ; 102 4 β O 7 5 ; 102 6 φ O 4 9 ; 102 8

for aclusterof HPworkstations,andfor CrayT3E usingMPI communications

t0 O 2 4 ; 102 5 β O 5 4 ; 102 8 φ O 5 8 ; 102 8 Assumingthe models(4-6,7), and f p p 1 for the HP-cluster, f p QP log2 p +R for theCrayT3E,wecomputethequantitiesS

CGS2 tMGS " tCGS2 SPMGS tMGS " tP 2 MGS SPMGSM tMGS " tP 2 MGSM

denotingthepredictedspeedupwith respectto modifiedGram-Schmidtin GMRES.In Fig. 6 the resultsareplottedasfunctionof n for s 60 and p 4 9 (HP-cluster),resp.

p 4 25 (CrayT3E). We observe that themodelpredictsthatCGS2is advantageousfor smallsubdomainsizes,whencommunicationis relatively expensive,ason theHP-cluster. Thisobser-vationhasbeenmadebeforein [11], whereCGS2andMGS arecomparedfor theGCRmethod

11

Page 14: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

20 30 40 50 60 70 80 90 1000

2

4

6

8

10 HP cluster

fC

GS

2, f P

MG

S, f P

MG

SM

subdomain gridsize n

___ CGS2

− − PMGS

...... PMGSM

p=9

p=4p=9

p=4

20 30 40 50 60 70 80 90 1000

0.5

1

1.5

2 Cray T3E

fC

GS

2, f P

MG

S, f P

MG

SM

subdomain gridsize n

___ CGS2

− − PMGS

...... PMGSMp=25

p=25

p=4

p=4

Figure6: Predictedspeedupfor P-GMREScomparedto GMRES

and the predictionsverified in actualexperiments. More importantly, the modelpredictsthatmodified Gram-Schmidtis more efficient in P-GMRESthan in GMRES, due to the reducedcommunicationcosts,bothon theHP-clusterandon theCrayT3E.Thespeedupvariesbetween2 8 for small subdomains(400unknownspersubdomain)with expensive communication(HP-clusterwith p 9) and1 0 for largesubdomains(10000unknowns)ontheCrayT3E.Moreover,anadditionalspeedupcanbeobtainedby themodifiedversionof P-GMRES,whencommunica-tion andcomputationarebalanced,i.e.

2n2φ O 2 f p t0 , β =3.2 Costs of subdomain solution

We assumethat eachsubdomainis solved on one processor, so no communicationbetweenprocessorsis necessaryandwe canrestrictourselvesto thesequentialcosts,determinedby thenumberof floatingpointoperations.

First, we considerthe exact solutionof the subdomainproblemusinga forward-backwardsubstitution,afteranLU-decompositionhasbeenmade.Let n2 be thenumberof unknownsinthesubdomain,andn half thebandwidthof thecoefficient matrix. Thentheaveragecostsof asubdomainsolutionis approximately

Cexact 8 4n3 , 2n4 " s φ (8)

12

Page 15: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

Operation Number Costs Definitionmatvec m 9n2φ Matrix-vectormultiplyprec m , 1 10n2φ ILU-preconditioningIP 1

2 m , 1 m , 2 2n2φ Innerproductaxpy 1

2m m , 1 2n2φ Vectorupdatescal m , 1 n2φ Vectorscalingsol 1 2mn2φ Solutionupdate

Table4: Numberof operationsandcostsfor m inneriterations

wheres denotesthenumberof outeriterationsin (P-)GMRESnecessaryfor convergenceof theglobalproblem,andφ thecostsof onefloatingpoint operation.

Secondly, supposethat the subproblemis solved inexactly by m iterationsof GMRESus-ing anILUD-preconditioner([19]) or anRILUD-preconditioner([1]). For thetwo-dimensionalproblemconsideredin this paperthenumberof nonzerosin a row of thecoefficient matrix willbe 5, andhence3 for the incompleteL andU factors. Then, the costsfor the matrix-vectormultiplicationandthepreconditionerare

Cmatvec 9n2φ Cprec 10n2φ

Notethatthesecostsmightbeslightly reducedusingEisenstat’strick [10], but thatisnotessentialhere.In Table4 we list thecostsandthenumberof basicoperationsin m inneriterations.

Neglectingthecostsof theconstructionof thepreconditioner(approx.10n2φ) andthesolu-tion of theHessenberg system(aboutm2φ), weobtainfor thecostsof theinexactsolution

Cgmres m 8 2m2 , 26m , 13 n2φ (9)

Comparisonof (8) and(9) shows that for this typeof problemssolutionby GMRESis onlycompetitive if 2 T n iterationsaresufficient to obtaina reasonableaccuratesubdomainsolution.Hence,the numberof unknowns in the subdomain,n2, shouldbe quite large, e.g. n 1 100,but in thatcasetheinner iterationwill probablyconvergeslowly and20 iterationsmight not besufficient. Alternatively, the subdomainproblemmight be solvedvery inaccurately(cf. [11]),leadingto asmallvalueof m. However, thenumberof outeriterationswill increasethen,andtheexactsolutionmightbethemostefficientoneafterall. Wewill pursuethis issuein thenumericalexperimentsfurther. Here,we concludethat, for the2D problemconsidered,exactsubdomainsolutionwill probablybe computationallythe mostefficient. Only whenmemorylimitationsexcludetheuseof a full LU-decomposition,subdomainsolutionby an iterative methodwill beof value.

4 Numerical experiments

In this sectionwe give numericalresultswhich provide insight into theconvergencebehaviourof P-GMRESin comparisonwith GMRES.We alsoassessthe performanceof several subdo-

13

Page 16: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

β 0 β 1M GMRES P-GMRES GMRES P-GMRES2 12 12 20 163 19 14 28 234 24 24 36 295 25 21 39 346 29 29 45 40

Table5: Numberof iterationsfor the60 ; 60grid with M ; M subdomains

mainsolversandevaluatethemeritsof variousstoppingcriteriain theinneriterationandrestartstrategiesin theouteriteration.All experimentswereperformedonanHP-755workstation,andwe will reporton the numberof floating point operationsasa measureof the efficiency of thesequentialsubdomainsolvers.Theparallelperformanceof thealgorithmsis predictedin Section3.1,andwill bethesubjectof evaluationin a forthcomingstudy.

As atestexample,weconsideraPoissonproblem,discretisedwith afinite differencemethodonasquaredomain.Suchaproblemis similar to thepressurecorrectionmatrix,which is solvedin eachtime stepof an incompressibleNavier-Stokessimulationto enforcethedivergence-freeconstraint[27], apartfrom someasymmetryin thepressurecorrectionmatrix. As thetestexam-ple is meantto modelsucha pressurecorrectionmatrix, we do not exploit thesymmetryin theexperiments.The domainis decomposedinto M ; M subdomains,eachcontainingn ; n gridpoints.With h ∆x ∆y 1"U Mn , 1 thediscretisationis

4ui j ui j 2 1 ui j ( 1 ui 2 1 j ui ( 1 j h2 fi j (10)

Theright-handsidefunctionis fi j f ih jh , where

f x y x 1 x 3 2βx 1 3y 324 , y 1 y 3 2βy 1 3x 324 (11)

HomogeneousDirichlet boundaryconditionsu 0 aredefinedon ∂Ω. Notethatthis exampleisalmostidenticalto theonein [11], apartfrom a differentdiscretisationat theboundaries.More-over, we introducedan additionalterm in the right-handside,asthe formulationin [11], withβ 0, suffersfrom an8-fold symmetrywhichusuallyspeedsup theconvergenceconsiderably.

4.1 Convergence behaviour with exact subdomain solution

In this sectionwe comparethespeedof convergenceof GMRESandP-GMRES.For all testsafixedrestartvalueof s 30wasused,andthesolutionwascomputedaftertheinitial (precondi-tioned)residualhasbeenreducedby a factorof 106. In all casesthesubdomainproblemsweresolvedexactly.

In thefirst experimentwe compareresultsfor a fixedproblemsizeon the60 ; 60 grid withM ; M subdomains,M 2 6, and two differentvaluesfor β in (11). In Table5 we listthe requirednumberof iterations.It is interestingto observe that thesymmetryin theproblem

14

Page 17: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

(β 0) influencestheconvergencesubstantially. Therefore,we think thatit is moreappropriateto considerthe casewith β 1 asa model for practicalproblems. Also notethat P-GMRESfor M evendoesnot profit asmuchfrom thesymmetryasGMRESdoes.This canbeexplainedby the fact that the solutionson the black andred subdomainsare identical in the symmetriccase;consequently, P-GMRESandGMRESconvergein exactly thesameway. In all othercasesP-GMRESrequiresabout20%lessiterationsthanGMRES.

We consideredgrids of dimension60 120 180 240 and300 in a secondexperiment. Thenumberof requirediterationsfor variousM ; M subdomainpartitioningsareplottedin Fig. 7.In all caseswe took the problemwith β 1. Oneseesthat P-GMRESconvergesfasterthan

50 100 150 200 250 3000

20

40

60

80

100

120

140

Domain size

# It

erat

ions

___ GMRES

− − − P−GMRES

o M=2

x M=3

. M=4

+ M=5

* M=6

Figure7: # Iterationsfor varioussubdomains

GMRESin all casesbut one(n 300 M 4). A marked improvementis obtainedfor an oddnumberof subdomainsanda relatively smallgrid. Herethedifferencebetweensolutionson theredandtheblacksubdomainsis mostpronounced,andP-GMRESprofitsfromthediscriminationbetweenthesetwo. For M even,therewill be2 blackand2 redsubdomainsin thecornersof Ω,andthesolutionson thesetwo setsof subdomainswill notbehaveverydifferently.

In a last experimentwe chosedifferentstartingvaluesfor the problemwith β 1 on the60 ; 60grid, viz.

x 0i j WV max i . j 2 n 2 1n i * n j * n

ui j otherwise (12)

whereui j denotestheexactsolutionof (10). Consequently, we starttheiterationwith theexactsolutionin all subdomains,but the first one. Table6 shows that GMRESdoesnot profit fromsucha good startingvalueat all, but P-GMREShasconverged after 1 iteration, as could be

15

Page 18: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

x 0 0 x 0 from (12)M GMRES P-GMRES GMRES P-GMRES2 20 16 25 13 28 23 35 14 36 29 48 15 39 34 51 16 45 40 48 1

Table6: Numberof iterationsfor the60 ; 60grid with M ; M subdomains

expected. This suggeststhat P-GMRESmight be very efficient for problemswhosesolutionsbehavedifferentlyon variouspartsof thedomain,suchaslayeredproblems[28].

4.2 Evaluation of approximate subdomain solvers

In this sectionwe comparetheperformanceof a numberof approximatesubdomainsolverstogetan impressionof which solversmight beanalternative for exactsubdomainsolutionin (P-)GMRES.Again, we useda fixed restartvalueof s 30 in the outer iteration, unlessnotedotherwise,anda relative toleranceof 102 6.

Thesubdomainapproximationswill bedenotedasfollows:% EX = exactsubdomainsolution,% GMRk = (restarted)GMRESwith a toleranceof 102 k, preconditionedwith RILUD,% GMRkF = (restarted)GMRESwith a toleranceof 102 kF, preconditionedwith RILUD.

Thelastmethodneedssomeexplanation.ThefactorF is givenby

F min X 10k 2 1 max 10k 2 7 res 0res j 1 res j0

res j 1 YZ (13)

whereres 0= res j0 < res j 1 denotethenormsof the initial residual,the residualat the be-ginningof thelastrestartandtheresidualafterthepreviousouteriteration,resp.Thismeansthatwe aim to reducetheresidualin a cycle of restarted(P-)GMRESby a factorof 102 k, unlesstheresidualat the last restartis alreadysmall. Moreover, the tolerancefor thesubdomainapproxi-mationis boundedabove by 0 1. Becausetheinaccuraciesfrom thesubdomainapproximationswill persistduringa cycleof outeriterations,wedo notcontinuetheouteriterationuntil s 30,but restartassoonasthecondition

res j / 2 [ j j0 102 kres j0 is satisfied.

16

Page 19: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

Solver 2 ; 2 3 ; 3 4 ; 4 5 ; 5 6 ; 6EX 16 23 29 34 40GMR8 16 19 9 23 17 4 29 15 3 34 13 8 40 12 8GMR6 17 16 5 26 13 6 31 11 8 36 10 8 40 9 9GMR4 25 11 8 42 9 7 47 8 1 55 7 3 59 6 9GMR7F 17 11 1 25 8 9 31 8 0 35 7 4 39 7 0GMR4F 23 7 1 38 5 9 40 5 4 53 4 7 49 4 5

Table7: Outerandinneriterationsfor varioussubdomainsolvers,60 ; 60 grid

4.2.1 Convergence on a small problem

We appliedP-GMRESfor thefixedproblemon the60 ; 60 grid with M ; M subdomains.Therequirednumberof outeriterationsandtheaveragednumberof inneriterations(in parentheses)is listedin Table7. For thesake of comparisonwe alsoincludetheiterationcountfor theexactsubdomainsolution.

It is seenthattheouteriterationdoesnotsuffer muchfrom anapproximatesubdomainsolu-tion, if a sufficiently small toleranceis imposedfor thesubdomainproblems(GMR8, GMR6).However, the numberof inner iterationsand thus the amountof work is ratherhigh in thesecases,evenfor thesmallsubdomainsconsideredhere.Relaxingthetolerancefor theinnerloop(GMR4) reducesthe numberof inner iterations,but leadsto a significantincreasein the num-berof outeriterations.Themethodsusinga flexible inner loop toleranceperformmuchbetter.GMR7F requiresthe sameamountof inner iterationsasthe inaccuratesolver GMR4, withoutmuchlossof accuracy in theouteriteration,whereasGMR4Fis abouttwiceascheapasGMR4,see(9).

4.2.2 Convergence on a larger problem

In this subsectionwe considerthe fixed problemon the 300 ; 300 grid, which hasalsobeenusedin [11] to assesstheperformanceof subdomainsolversfor GCR(although[11] hasβ 0in theright handsidefunction(11)). We appliedP-GMRESwith theflexible subdomainsolverGMRkF for variousM ; M partitioningsof Ω, andlist thenumberof outerandaveragedinneriterationsin Table8. For the sake of comparisonwe alsoquotethosenumbersfor GCR withGMR6 from [11].

Note that the flexible tolerancesubdomainsolversperformquite satisfactory, even for theratherrudetolerancesin GMR4F. Thedifferencein inneriterationsbetweenthefixedstrategy inGCRandGMRkFis striking,whereastheconvergenceof theouteriterationis comparable.Thiscanbeexplainedby the fact that inaccuraciesintroducedby thesubdomainsolvesaresoonde-tectedby therestartsin theouterloop,sothey donothaveachanceto spoil toomany subsequentiterations.Fig. 8 illustratestheconvergenceof P-GMRESwith theGMR4Fsubdomainsolverfor the6 ; 6 subdomaincase.

Thepeaksin thegraphfor GMR4Foccurat therestarts,indicatingthatthecalculatedresid-

17

Page 20: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

Solver 2 ; 2 3 ; 3 4 ; 4 5 ; 5 6 ; 6EX 32 59 87 94 123GCR 78 68 4 83 38 7 145 31 4 168 26 4GMR6F 84 14 4 105 13 0 116 12 9 114 12 8 142 12 0GMR5F 62 17 3 102 12 2 123 11 8 115 12 2 170 10 6GMR4F 110 12 1 144 10 8 161 10 0 163 10 1 175 10 7

Table8: Outerandinneriterationsfor varioussubdomainsolvers,300 ; 300grid

0 20 40 60 80 100 120 140 160 18010

−6

10−5

10−4

10−3

10−2

10−1

100

101

300 × 300 grid, 6 × 6 subdomains

Calcu

lated

resid

ual

Iteration

_____ EX

− − − GMR4F

Figure8: Convergenceof P-GMRES(GMR4F),300 ; 300grid, 6 ; 6 subdomains

ual is contaminatedby inaccuraciesfrom thesubdomainsolve. However, subsequentiterationsquickly diminishthevalueof theseresiduals.Thenumberof inneriterations,usedin this exam-ple, is plottedin Fig. 9. It is seenthat therequirednumberof inner iterationsrapidly decreasesin thecourseof anoutercycle,usingtheflexible tolerance.Consequently, theaveragednumberof inneriterationsis substantiallyreduced.

In someexperimentswe observedthat thebehaviour of theresidualwasquiteerraticin thefinal stageof convergence(seeFig. 8). Then,thecalculatedresidualin theouteriterationis smallenough,hencethecycle is terminated,but theactualresidual,calculatedafter theupdateof thesolution,doesnot yet satisfythestoppingcriterium,soa new cycle of outeriterationsis started.Thisphenomenonis probablydueto thecrudetolerance( tol O 0 1) with whichthesubproblemsaresolved.Requringtol * 0 01givesamuchmoreregularbehaviour of thecalculatedresidual,howeverat thecostsof moreinneriterations.

Finally we remark that in [11] GCR is also appliedwith the subdomainsolvers GMR2,GMR1 andRILUD, the latter standingfor just oneforward backward substitutionwith the in-completefactorsfrom theRILUD decomposition.Wequotetheirresultsfor the5 ; 5 subdomaincasein Table9.

They note that the inner loop toleranceof 0 1 is insufficient for fast global convergence,

18

Page 21: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

0 20 40 60 80 100 120 140 160 1802

4

6

8

10

12

14

16

18

Outer iteration

Avera

ge nu

mber

of Inn

er Ite

ration

s

300 × 300 grid, 6 × 6 subdomains

Figure9: Inneriterationsin P-GMRES(GMR4F),300 ; 300grid, 6 ; 6 subdomains

GCR GCR GCR GCR P-GMRES(GMR6) (GMR2) (GMR1) (RILUD) (GMR4F)

168 26 4 192 10 9 303 5 9 437 1 163 10 1Table9: Outerandinneriterationsfor varioussubdomainsolvers,300 ; 300grid

althoughit is still anexpensive subdomainapproximation.We did not apply thesecrudetoler-ances,becausetheflexible strategy GMR4Falreadyinvokeslarge inner loop tolerancesduringpartof theouteriterations,see(13), andmoreover (P-)GMRESis moresensitive to inaccuratesubdomainsolutionsthanGCR.We alsodid not apply RILUD asa preconditioner. AlthoughGMREScanbecombinedwith RILUD very well, P-GMRESwill fail, assomeaccuracy in thesolutionof thesubdomainproblemis requiredfor this method.

4.2.3 Performance for the larger problem

Fig. 10 shows theperfomanceof theP-GMRESmethodwith varioussubdomainsolversfor the300 ; 300grid. Thefigureshowsthenumberof floatingpointoperationspersubdomainrequiredfor the convergenceof the outer iteration. Thesenumbersarecomputedfrom the costsof theouteriterationand(8), (9). Increasingthenumberof subdomainsreducesthecomputationalcostssubstantially, showing the potentialof the domaindecompositionmethodfor parallelisation.Note that the exact subdomainsolver is the most efficient one for the relatively small gridsconsideredhere.

19

Page 22: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

0 5 10 15 20 25 30 35 400

0.5

1

1.5

2

2.5x 10

9

# flop

s per

subd

omain

# subdomains

Various subdomain solvers, 300 × 300 grid

__ EX

o GMR4F

x GMR5F

+ GMR6F

Figure10: Numberof operationspersubdomain,300 ; 300grid

5 Conclusions

ForapplicationswhichrequiredomaindecompositionthepartitionediterativemethodP-GMRESoffersadvantagescomparedto GMRES,bothwith respectto speedof convergenceandcommu-nicationcosts.Theseadvantagesareexpectedto be morepronouncedfor problemswith largevariationsin thesolutionon thevarioussubdomains,e.g. in layeredproblems([28]). However,P-GMREScanonly beappliedif a redblackcolouringof thesubdomainsis possible.

Althoughmonotoneconvergenceof (P-)GMRESis only guaranteedin caseof exactsubdo-mainsolution,it is alsopossibleto solvethesubdomainproblemsapproximately, in combinationwith restarts.Ourexperimentsindicatethatalargereductionin computationtimeis obtainedby aflexible tolerancestrategy in thesubdomainproblems,whichcontradictstheoreticalsuggestionsin literature([25]). A rathercrudetolerance(tol O 0 1) is allowedin thesubdomainsolutions,buta strictertoleranceleadsto a smootherconvergence.However, for thesizeof theproblemsweconsidered,upto22500unknownspersubdomain,anexactsubdomainsolver still turnedout tobethemostefficient one.Nevertheless,it seemsworthwhileto modify P-GMRESaccordingtotheideasin [21] to recover themonotoneconvergence.This wouldonly costadditionalstorage,which is usuallynot aproblemonaparallelsystem.

Consideringthecomputationalwork only, thedomaindecompositionmethodprofitsfrom alargenumberof subdomains.Notwithstandingtheslowerconvergence,thework persubdomaindecreases.This allows good opportunatiesfor parallelisationif the communicationcostsaresmall. A performancemodelindicatesthattheorthogonalisationin P-GMRESis abouttwice ascheapasin GMRESfor relatively smallproblems;using9 processorsandup to 5000unknownsper processorfor the HP-cluster. The advantageis about20% on large problems,e.g. 10000unknownspersubdomainon theCrayT3E. Themodelalsoindicatesthata speedupof 1 5 canbeobtainedif computationandcommunicationarebalanced,andcanbeoverlapped.For verysmall problemsit might be advantageousto usethe re-orthogonalisedclassicalGram-Schmidtorthogonalisation,assuggestedin [11].

20

Page 23: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

References

[1] O. AxelssonandG. Lindskog. On theeigenvaluedistributionof a classof preconditioningmethods.Numer. Math., 48:479–498,1986.

[2] Z. Bai, D. Hu, andL. Reichel. A Newton-basisGMRESimplementation.IMA J. Numer.Anal., 14:563–581,1994.

[3] A. Bjorck. Solvinglinearleastsquaresproblemsby Gram-Schmidtorthogonalization.BIT,7:1–21,1967.

[4] C. Borgers. The Neumann-Dirichletdomaindecompositionmethodwith inexact solverson thesubdomain.Numer. Math., 55:123–136,1989.

[5] E. Brakkee. Domain decomposition for the incompressible Navier-Stokes equations. PhDthesis,Delft Universityof Technology, Delft, TheNetherlands,April 1996.

[6] E. Brakkee, C. Vuik, and P. Wesseling. Domain decompositionfor the incompressibleNavier-Stokesequations:Solvingsubdomainproblemsaccuratelyandinaccurately. Int. J.Num. Meth. Fluids, 26:1217–1237,1998.

[7] T.F. ChanandT.P. Mathew. Domaindecompositionalgorithms.In A. Iserles,editor, ActaNumerica, pages61–143,Cambridge,1994.CambridgeUniversityPress.

[8] E. deSturlerandH.A. vanderVorst. Reducingtheeffectof globalcommunicationin GM-RES(m)andCG on paralleldistributedmemorycomputers.Appl. Numer. Math, 18:441–459,1995.

[9] K. Dekker. ParallelGMRESanddomaindecomposition.Report,Delft Universityof Tech-nology, Delft, 2000.

[10] S.C.Eisenstat.Efficient implementationof a classof preconditionedconjugategradientmethods.SIAM J. Sci. Stat. Comput., 2:1–4,1981.

[11] J. FrankandC. Vuik. Parallel implementationof a multiblock methodwith approximatesubdomainsolution.Appl. Numer. Math., 30:403–423,1999.

[12] G. Haase,U. Langer, andA. Meyer. The approximateDirichtlet domaindecompositionmethod,PartI: An algebraicapproach.Computing, 47:137–151,1991.

[13] G. Haase,U. Langer, andA. Meyer. The approximateDirichtlet domaindecompositionmethod,PartII: Applicationsto 2nd-orderelliptic BVPs. Computing, 47:153–167,1991.

[14] G. Haase,U. Langer, andA. Meyer. Domaindecompositionpreconditionerswith inexactsubdomainsolvers.J. Numer. Linear Algebra Appl, 1:27–41,1991.

[15] R.W. Hockney andC.R.Jesshope.Parallel Computers 2: Architecture, Programming andAlgorithms. AdamHilger, Bristol, 1988.

21

Page 24: DELFT UNIVERSITY OF TECHNOLOGYta.twi.tudelft.nl/TWA_Reports/00/00-02.pdf · DELFT UNIVERSITY OF TECHNOLOGY REPORT 00-02 COMPARING GMRES AND P-GMRES IN DOMAIN DECOMPOSITION WITH APPROXIMATE

[16] W. Hoffmann. Iterative algorithmsfor Gram-Schmidtorthogonalization. Computing,41:335–348,1989.

[17] W. JalbyandB. Philippe. Stability analysisandimprovementof theblock Gram-Schmidtalgorithm.SIAM J. Sci. Stat. Comput., 12(5):1058–1073,1991.

[18] G. Li. A block variantof theGMRESmethodon massivily parallelprocessors.ParallelComputing, 23:1005–1019,1997.

[19] J.A. Meijerink andH.A. vanderVorst. An iterative solutionmethodfor linearsystemsofwhich thecoefficientmatrix is asymmetricM-matrix. Math. Comp., 31:148–162,1977.

[20] A. Meyer. A parallelpreconditionedconjugategradientmethodusingdomaindecomposi-tion andinexactsolverson eachsubdomain.Computing, 45:217–234,1990.

[21] Y. Saad. A flexible inner-outer preconditionedGMRES algorithm. SIAM J. Sci. Stat.Comput., 14:461–469,1993.

[22] Y. SaadandM.H. Schultz.GMRES:a generalizedminimal residualalgorithmfor solvingnonsymmetriclinearsystems.SIAM J. Sci. Stat. Comput., 7:856–869,1986.

[23] A. Segal,P. Wesseling,J.vanKan,C.W. Oosterlee,andC. Kassels.Invariantdiscretizationof theincompressibleNavier-Stokesequationsin boundaryfittedco-ordinates.Int. J. Num.Meth. Fluids, 15:411–426,1992.

[24] B.F. Smith, P.E. Bjørstad,andW.D. Gropp. Domain Decomposition; Parallel MultilevelMethods for Elliptic Partial Differential Equations. CambridgeUniversity Press,Cam-bridge,UK, 1996.

[25] K.H. Tan. Local coupling in domain decomposition. PhD thesis,Utrecht University,Utrecht,TheNetherlands,April 1996.

[26] H.A. van der Vorst andC. Vuik. GMRESR:a family of nestedGMRESmethods.Num.Lin. Alg. Appl., 1:369–386,1994.

[27] J.J.I.M.vanKan. A second-orderaccuratepressurecorrectionmethodfor viscousincom-pressibleflow. SIAM J. Sci. Stat. Comput., 7:870–891,1986.

[28] C. Vuik, A. Segal, andJ.A. Meijerink. An efficient preconditionedCG methodfor thesolutionof aclassof layeredproblemswith extremecontrastsin thecoefficients.J. Comp.Phys., 152:385–403,1999.

[29] H.F. Walker. Implementationof theGMRESmethodusingHouseholdertransformations.SIAM J. Sci. Stat. Comput., 9:152–163,1988.

[30] P. Wesseling,A. Segal, C.G.M. Kassels,andH. Bijl. Computingflows on generaltwo-dimensionalnonsmoothstaggeredgrids. J. Eng. Math., 34:21–44,1998.

22