a parallel multifrontal algorithm and its implementation

C-~

~~~~.~~~

ELSEVIER Compu!. Mclhods App!. I\kch. Engrg. 149 (1997) 2B9-301

Computer methodsin applied

mechanics andengineering

A parallel multifrontal algorithm and its implementationp. Geng .. I.T. Oden*. R.A. van de Geijn

Te.\lIs IIISlillll" Jill' Cmlll'lllllticl/1a/ alld Applied Mlllhelll(tl;cs, The IIlIi""l'sit.\' '1' Texas <II Allstill. A,m;lI. 1X 78712, USA

Ahstract

In this paper. we deserille a multifrontal melhod for ,olving sparse syslems of linear equalions arisin~ in tinile element and tiniledifference methods.

The method proposcd in this sludy is a l:Ombinalion of the nested disseclion ordering and Ihe fmnwl mcthod, II can significantly redlll:eIhe storage and comput;uilJl"ll lime required by Ihe convenlional din:el methods and is abo a nalural parallel algorilhm. In addition, lhemelhod inherilS major advantages of lhe frontal mel hod. which include a simple interface wilh tinite eleml'm codes and an CffCClivc datastructure so Ihal the emire computation is performed c1cmcm by element on a scrks of small linear systems with dense stiffness malrices.

The numerical implementation largets bOlh distlihuled-memory machines as well a, conventional sequential machines. It, pertormance istesled through a series of examples.

I. Introduction

In Ihis paper. we seck for an eftlcient direct method of solving the system of linear equations.

Ax=b ( I)

where A = [ll') IN'>.'" is a sparse matrix generated by tinite clemcnt methods or tinitc difference methods.b = {bJ" is a known vector. x = {xJ" is the unknown solution vcctor to be solved and N is the total number ofunknowns and is also referred to as the total degrees of freedom.

Sparse systems of linear equations are typically solved by one of two different methods-ilerative methods ordirecl methods. A direct method involves explicit factorization of the sparse matrix A into the product of lowerand upper triangular matrices Land V. and it generally requires much more computer time and storage thaniterative methods. However. direct methods are important because of their generality and robustness. Cn manycases. dircct methods are preferred because the effort involved in seeking a good prel:onditioner for an iterativesolution oftcn outweighs the cost of dircl:t factorization. Furthcrmore. direct methods provide effective meansfor solving systems with the same stiffness matrix A and different right-hand vectors b hecause the factorizationneeds to be performcd only once.

A direct method is completed in two steps: LV factorization of

A=LV

followed by solving the triangular systcms

Ly =b

* Corresponding aut hilL

0045-7825/97/SI7.00 Q) 1997 Elsevier Science S.A. All righls reservedPII S0045- 7B25( '17100052- 2

(2)

(1)

290

and

Ux=y

P. Gl'IIX et (II. I COllll'lIt. Methods AI'I'/ . .'.1(·('h.Ellgr.lf. 149 (/997) 289-}O/

(4)

whcre L is a lower triangular matrix with unit diagonal coefficients and U is an upper triangular matrix.The factorization of an N by N matrix A is completed in N steps. Selling t\ I = A. we have

A,=(6 ,~)=(~(:,:"~~)){ c°'."~))(~G B,>::'J x~(",;'",:])

until

AN = 'N (5)

where Vi and /I i are two vectors of Icngth N - i. and 'N _, is an (N - i) X (N - i) identity matrix. In the sequelB' = B - V /I T / II· is known as thc \Xlrl of A remaining to be factored after the first i stel)S of the factorizmion

I I /; " ...

have been pcrformcd. We <llso rcfer to performing the ith step of the factorization as climinating variable Xi oreliminating thc ith row and wlumn of A.

In ordcr to improve numerical stability as wcll as avoiding tlii = 0, we permute certain rows and columns ofmatrices so that tli; has the maximum absolute value among all coefficients on caeh remaining matrix Ai' Thisprocedure is known as pinnillg and (2) should be written as

A =PLUQ (6)

where I' and Q are the matriccs n;presellling the accumulation of permutation on thc row and columninterchanges required by the pivoting.

Another important fact about sparse systems of linear equations is that because the zero entries of the matrixHi may not be the zero entries of V;ll~. some entries that are initially zero in A become nonzero afterfactorization and those entries are known as fill-ill. The .fill-ill is inevitable in the factorization of sparse systems.but it can be drastically reduced through reordering unknowns of sparse systems of linear equations. 1\10reprecisely. let P be a permutation matrix. we ean choose J> such that the factorization of PAP T has much less.fill-ill than that of A. Clearly. the permutation matrix I' used here is differcnt from the one used in (6). Ingeneral. the requirement of the pivoting in the factorization process will restrict thc ability of using the orderingof unknowns to minimize the amount of .fill-ill. In some direct methods. including the frontal methods andsupernodal methods. a limited pivoting can be applied without affceting the performance of the method (see18.2]) and we will discuss this later.

Without pivoting. once the ordering is determined. the precise locations of all fill-ill entries in Land U can bepredicted in advance. The process by which the nonzero structures of I. and U is determined in advance is calledsymbolic factorizalioll 17 J.

A heuristic method which has been found to be very effective in finding efficient orderings is the minilllulIldegree algorithlll [10]. At each step. this method sclects the row with the least number of nonzero entries as thenext row to be eliminated. The minimum degree algorithm is easy 10 implement and produces reasonably goodorderings over a remarkahly hroad range of problem classes. However. because the minimum degree algorithmis a heuristic method and there is Jack of the theoretical foundation for it. its success is not well understood and

I'. Gmg t't a!. 1 COli/pili. MetllOds Apl'l. Aleck Engrg. 149 (1997) :!1J9-3UI 29\

there is no robust and efticient way to deal with possible variability in the quality of the orderings which itproduces. FUrlhennore. the minimum degree algorithm inherently is a sequential process, and it will be difficullto develop a parallel direct sparse solver based on this method,

Another effective ordering algorithm is the nested dissection I II. Thc nested dissection algorithm is anordering method based on a sequence of nested disscctions on the domain. By dividing a given domain into twosubdomains. we are able to reorder unknowns of the system in such a way that

0 A,,)A ~~ A ~3 .

A 3~ AJJ(7)

Where A t I and A:~ are the matrix blocks corrcsponding to thc interior nodes of each subdomain. A 33 is thematrix block corresponding to the nodcs on the interface between two subdomains. and A 13' A.'J' All and Aj1

arc coupling matrix blocks between interior nodes and nodes on interface boundaries. The significance of thematrix partitioning given in (7) is that the zero blocks are preserved in the factorization so that the jill-in islimited. This idea can be applied recursively, i.e. we break each subdomain into smaller and smallcr pieces tolimit .fill-ill on diagonal matrix hlocks. Furthennore. the successive dissection on each subdomain willautomatically place 1110stunknowns associatcd with the interior nodes before the unknowns associated with thenodes on the interface boundaries so that jill-ill in the off-diagonal matrix blocks such as A 13' A 13' A 31 and A 31

is also limited. In I II. George proved thaI a nested dissection ordering for a uniform mcsh on a square domaincan reducc the arithmetic operation counts from the usual O(N1

) to O(N.lI1) and reduce the memory requirementfrom o(N·l/ 1) to O(N log1 N I 12). In addition. two major matrix blocks A J I and A 1~ are completely decoupledand can be proceeded separately on different processors. thus the nested dissection algorithm is a natural parallelalgorithm which not only enable us to limit Jill-ill but also promote concurrency at each level of the nesteddissection.

Despite its theoretical superiority. the nested dissection algorithm gets little application in practicalcomputation A major difficulty which the nested dissection method runs into in its implementation is that thenested dissection requires that the original domain be divided into several hundreds to thousands of subdomains.and then a special data structure or house-kceping schemc is required to have an order of eliminating interiornodes first and interface boundary nodes next. furthermore. such data structure or housekeeping scheme must beeasy to be implemented and possesses a simple interface with existing tinite element codes.

In this paper. we will demonstrate that the above difticultil:s can be overcome by a modified frontal methodor. more precisely. a multifrontal method. The paper will be organized as follows. We first give a simpledescription on the frontal method and the method used to control factorization in practice: next we discusscertain details of the implementation on the conventional sequential machines as well as parallel machines:finally the experimental results are presented.

2. Thc Ilvcrall stratcg)'

The frontal method was originally proposed by Irons [9] in 1970 and latcr Dufff3] introduced the multi frontalmethod to solve the problems with indefinite lincar systems. In this study. we extend their work and develop amethod which enable us to eliminate unknowns in a nested dissection ordering.

To demonstrate the frontal method. the four-element and nine-node mesh shown in Fig. I will be considered.We ass lime that there is only one unknown associakd with each node and all the equations are stored in theorder of aseending node number.

The stiffncss matrix and the right-hand side vector arising in finite element methods can be naturallyexpressed in forms of

and

.It

A = 2: Am111=1

(8)

P. Gell,~..t al. I COl1ll'lH. Methods AI'I'I. ,Hecll. Ellgrg. [49 (/997) lIN-Jot

3 6 9

2

1

)( IV

5 8

I III

4 7

M

b= L IJ"m.;1

ri~. 1. The mesh of 4 finite clements and the node numbers,

(9)

where Am and h'" represent the l:ontribution from a single tlnite element i and M is the total number of elements.Bel:ausc most entries in Am and /I'" are zero. we can pal:k each A'" into a small full matrix (known as Ille e1emelll,wIllless mill ri.r) and each b'" into a small full Vel:lOr (known as the dell/elll foud "ector). and then a vector ofindil:cs is created to label where each entry of the packed matrix and vel:tor fits into the global matrix A and thevector b, The process of adding the element stiffness matrices and load vectors into thcir global matrix andveclOr is known as assembly.

The index vel:tors created in the frontal method are referred to as the efell/em destinatio/l '·ecto/'.l'. Instead ofadding the local stiffness matrices and load vedors into a global system, the frontal method adds the localmatrices and vectors into a small dense linear system which we refer to as a front. The size of the elementdestination vector is equal to the number of nodes on the element. The combination of all element destinationvectors is the destination "ector, Initially. the wmponents of each element destination vector is set to be thenicknames of nodes on the element. i.e.

the nickname = the node number X A( + 11,< ( 10)

and lvf.. is an integer which must be larger than the maximum number of unknowns associated with each nodeand n, is the number of unknowns associated with the corresponding node,

The initial form of the destination vector is also called as the nicknall/e !'ector. Each nickname componentcontains two picl:cS of information. the node number and the number of unknowns associated with each nodeand the nickname vector provides the information about the nodc <.:onnectivity on each element and the numberof unknowns associated with each node,

For the example shown in Fig. I. we have assumed that II,. = I and will usc M,. = 10 thus the nil:kname vectorshould be

(11. -U. 5!. 21. for the element I21. 51. 61. 31. for the element II41. 71. ~1. 51. for the element III (II)

51. 81. 91. 61 ) for the element IV .

The major function of the syrnboli<.: factorization is 10 convel1 each component of the destination vector from( (0) into a form of

the location index X 10 X tv1.. + II,. X 10 + lIag. ( 12)

Here. the fO('lIfiol1 illdl'x gives the location where each entry of the clement matrix and load vector tits into thefront and

{o:flag = I:

for not the last occurrence of the nodefor the last occurrence of the node (13)

I'. C"Il!? 1'1 III. I COII/I'IlI. MetllOds API'I, Meclz. Ellgrg. /49 (1997) 289-,llJ/ 293

gives the information whethcr the corresponding entries at the frontal matrix ami vector have been fullyassembled or not. Clcarly. Ill" 1(1.1'1occltrrence of tile I/ode or nag = I means .1'1'.1'and otherwise means I/O. Forcxamplc, after the symholi<.: factorization. the vector shown in (II) should be convcrtcd into

( III. 210, 310. 410. for the element I311. 210, 410. 511. for the element II211. 411. 510. III. for the element III ( 14)

Ill. 311. 411. 411 ) for the element IV .

The frontal mcthod is operated in an element by element basis. For the example considered here. the firstfront should be

('I I ,,:x~,)()all a 14 (/I~

":'I I

Cl44 1I~} Cl~2 .\~ _ b~(15)I I I - t

a}1 a~~ (/ ~~ lIS2.\'s hsI I I I I

(/21 Cl~~ II :!~ lI:!z x~ b1

where the superscript denotcs the element number from which the matrix and the righI-hand side vector entrieswere derived and the subscripts are the global positions of thc corresponding coeflicients. Thc !lags at thecorrcsponding clement destination vector (the first part of vcctor in ( 14» tell us that the first row and column(corresponding to thc Node I) have been fully asscmbled and can be eliminatcd from (151. After elimination.the state of the front bccomes

( 16)

where

and

Next. we add thc second clemcnt to (16) to form a new front. The lo<.:ation indices in the second elementdestination vector label where each entry of the local stiffness matrix and load ve<.:tor tits into the new front.After assembling. the front should be

ClH a~2 x~ b4II II tt II b;ass + (/S~ tlS2 + Cl}2 056 as)II II II II

h, I.Cl2S+(/~S (/~2 + (/22 a26 a23 = ( 17)II II II II b6at,S a6~ aM a63II II II II

b)(/ .\~ (/32 a36 a33

Similarly. from thc flags in thc destination vector. we know that the second and third rows and <.:olumns can beeliminatcd from ( 17) and after elimination. the front is ready for the next element. Thc LV factorization andforward substitution are complete after the same procedure is repeated for all elements. The backwardsubstitution is performcd in a reversed order (starts at the last element) and the procedurc is similar.

The order of unknowns being eliminated in the frontal method is determined by the sequence of em and bmbcing added into lhe fronts: and for the example considercd here, the unknowns arc eliminated in an ordcr of (bythc node numbersl

(I,2.3.4,7.5.K.9.6)

which is not a nest cd dissection ordering. Next. we discuss a modified frontal melhml which can eliminate

294 1'. Gmg ell/I. I COlIIl'l/t. Methods AI'I'1. Mel'''. EI/grg. /49 (/997) 289-301

unknowns at a nested dissection ordering and the same example will be used to demonstrate the solvingprocedure.

The new method requires a series of nested dissectioll on the original domain and a set of extra auxiliaryclements to control the sequence of elimination on eal:h subdomain. As shown in Fig. 2. the first level of thedissection divides the original domain given in Fig. I into two subdomains which are separated by the boundaryconsisting of the nodcs

(4.5.6). ( 18)

The second level of the dissection funher divides the domain into four subdomains and they are separated by theboundaries consisting of the nodes

(4.5.2). (2.5.6). (4.8.5) and (5,8,6). (19)

respectively. The factorization stalls from the higher level 10 the lower level and at each level. the frontal solverworks independently on each subdomain and is restricted only to eliminate the unknowns ass{)(;iated withinterior nodes. In order to do this. we neate the following sets of the nickname vectors:

the second-level dissection:

(I I.41, 51, 21.(21. 51, 61, 31.(41. 71. 81. 51,(51, ~n,91. 61.

the first-level dissection:

41,51.21)21,51,61)41.81.51)51.81,61)

(20)

(41.51. 21.(41.71.51.

21.51,6151,81,61

41.51.61)41.51.61)

(21 )

alld filially. the zero-Iel'el dissecfion:

(41.51.61) (22)

For convenience. here we consider thc original domain as the subdomain of the zero-level dissection.E,\(;h nickname vector in (20) is wmposed of the element nickname vectors for the original elements on the

subdomain plus a so-called control element. The control elements l:onsist of the node on the separationboundaries (one of the node vectors given in (19». Thcll. the control elements at the second-level dissectionbecomes the new clements at the lirst-level dissection. Each nickname vector at the first levcl is composed oftwo pans: the nickname vector of the control e1cmellts from the second-level dissection and the nickname vectorof its own wntrol element which also consists of the nodes on the separation boundary (the node vector given in3

066[]93 6 6 9

II IVII I I IV I

2 5 5 82 1 51 15 18

III I 205 5~r4 4 7

4 4 7Fig. 2. The two levels of Ihe nested domain decomposition on the mesh given in Fig. I. After (he second domain decomposition, Ihel11ullifronlal melhod proposed in (his sllIdy can eliminate (he unknowns in a nesled disseclion ordering.

P. Geflg et al. I COIIII'/II. Methods Appl. Mec". ElIgrg. 1-+9(/9971 .!89-,ltll 295

(18 ». Similarly. the wntrol element at the tirst-lcvcl dissection becomes the new ch:ment of the zero-leveldissection and there is no need of the conlrol clement for the zero-level dissection.

Because of using the control elements, we can proceed with each nickname vector independently during thesymbolic factorization hul still have thc lIags in the destination vectors being set ClllT\:Clly.After lhe symbolicfactorizalion. the "eclors given (20. (21) and (22) are converted into

Ihe secolld-Iel'el di.I'.\·('Clioll:

(Ill, 210. 310.410.( 110. 210. 3\ 0, 411,(110,211, 310.410,(110.210. 311, 410,

the first-Iewl diss('ctiol/:

111,211.311)111,211.311)111,211.311)111,211,311)

( 110. 2 10. 310.(110.210.310,

311, 210.410.310.211,410.

111,211.311)211.111.311)

lIl/d filially the zero-It'I'd cliss('Clioll:

(1/1,121,131).

Then. the last c1cmcnt vector (thc cOlllrol element) on each destination vector will be skipped after the symbolicfaclorizalion. The real computation is performed in seven steps and each step will he colllrolled by thedestination vectors

( II I, 210. 31D. 410)(110. 210. 310.411 )( 110. 211, 310. 41 D)(110. 210. 311. 410)(110. 210. 310.311. 210.410)(110.210.310, .3111.211. 410)(Ill. 121, 131)

step]slep 2step 3step 4step 5step 6slep 7

(23)

The computation on the different subdomain can proceed independently and lhe flags in thc destinalion willonly allow the elimination of the unknowns associated with the interior nodes. During the steps 1-4. theunknowns associated with the nodes I. 3, 7 and 9 are eliminated. EHch step will open a new working frontand thcrc will be four unfinished working fronts at the end of step 4. Steps 5 and 6 will combine the unllnishedfronts from the previous sleps and eliminate the unknowns associated with nodes 2 and 8, Steps 5 and 6 opentwo new fronts. Then. lhose two unfinished fronts are added together at the IInal ~tep and the unknownsassociated with nodes 4. 5 and 6 are eliminated. The unknowns here are eliminated in an order of (by the nodcnumbers)

(1.3.7.9,2.8.4.5.6)

which is a nested dissection ordering.Because there exisl more than one working front at certain stages of the compulalion. the method discussed is

a multi frontal method. A special issue arising in this method is that the extra storage mllst be allocated to savethose untinished fronts during computation. Considcr lhe cubic mesh shown in Fig. 5. To reduce the maximumnumber of unfinished working fronts. instead of the procedure as we discussed above (also as shown in Fig.6(a». the factorizalion is actually performcd in a sequence as shown in Fig. 6( b) and it can reduce the m~ximumnumber of unfinished fronts from P to log2 P.

Most importalllly. in lhe sequence givcn in Fig. 6(b). the storage of the working frollls em hc simply handledas a Slack. To clarify this, eonsidcr Fig. 6(b). Afler lhe (irst step, we push lhe front into thc Slack and thcn star!step 2. Al the end of step 2. we pop the front saved in the first step out of the stack. add it to the current frontand do the factoriz.uion for step 3. At the cnd of step 3. the current front is pushed into Ihe stack and also at theend of step 4. At this moment. there are two fronts in the stack and under the IClsl ill allll.lirs( 0111 stack rule. the

296 P. G"llg et al. I COII/pl/t. Methods API'I. MeL'''. Ellgrg. 149 (/997) 289-]01

front saved at the end of step 4 will be the tirst one to be popped and it is exactly what we want at the end ofstep 5. The front saved at step 3 is the next one to be popped out and it is also exactly what we want at the endof step 6. We then combinc the fronts from steps 3 and 6 togethcr and com pIetc thc work. Thc memory requiredby the stack is dynamically allocatcd and rcaclws its maximum at the cnd of the step

/

L 2' -I + I1=0

where P = 2' and then the entire memory will be released before the end of factorization, The same stackalgorithm is also used to save and update the right-hand sides of fronts during the forward and backwardsubstitution, The major advantages of the stack algorithm proposed here is that it can avoid the complexity ofsaving a list of addresses pointing 10 memory location of frollls and greatly simplifies the implementation.

3. Performance

In this section. we will discuss the performance of the mcthod proposed and demonstrate that after a propersequel of nested domain decomposition. the frontal width is reduced greatly and so are thc storage andoperational counts.

The stiffness matrices arising in thc tinite clcment or tinite differencc mcthods arc always structurallysymmetric. Matrix A = rllij INXN is said to be structurally symmetric if (/!i # 0 follows that lip # 0 for I ::;;i,j ::;;N. For a structurally symmetric matrix. the number and position of nonzero entries in the vectors vk and /Ik

in (2) are identical and the number of arithmetic operations required 10 factor a structurally symmetric matrix Acan be calculated by

N- ) N-I

()= L (2µ~+ µk) "" 2 L µ~k~ J A~ I

and the total number of nonzero entries in Land U matrices will beN N

T/= L (2J1-k -1)""2 L µA'k~1 k-I

(24)

(25)

where µk is the number of nonzero entries in vk or /Ik· In the frontal method. we also call µk the frontal width.In [I]. George proved that for the problem with a uniform mesh of linear elements on a square domain. the

nested dissection method could reduce thc memory requiremcnt from O(N3n) to O(N 112 log~ N) and the numbcr

of arithmetic operations from O(N~) to O(NJ/\ Thc George' s analysis requires that the -original domain bedissected iteratively until each new subdomain contains at most 2 X 2 = 4 elements. In this study. we considerthe performance of the method in a more realistic basis requiring only

P»l andMp»1 (26)

where P is the total number of subdomains after the tinal level of the dissection and M is the total number ofelements.

For the problem ~ven in Fig. 3. the frontal width (i.e. µk) of using the conventional singlc frontal method onaverage is about sIN and by using (24) and (27). we have

and

N-) N- I

()= 2 L µ~= 2 L (·../i/)~= 2N2

k= I k~ I

N '"T/ = 2 L µk = 2 L ..../N = 2N3

/2

.k~1 k=1

(27)

(28)

For the multifrolllul method. we suppose that after the / levels of the nested disseetion. the original domain isdivided into P = i square subdomains with II, = ,iN/p + I nodes on each side. Becausc the elimination of the

1'. Gl'IIg t't al. / COIIII'III. M.,tlltuls Appl. Meeh. ElIgrg. 149 (/997) !89-301 297

.I

i The element ( i, j )The Element ( i, j )

j

~

I

Fig. 3. The two·dimensional eJlample. The uniform mesh of linear e1emems on a square domain. HereJhe from moves from the righl side tothe left. TIle frontal width (the nUII1hc:rof node, covered hy the shadow) on average is equal 10 ...N.

Fig. 4. The frulllal width on a suhdoll1ain. We supposc that the original square dOl\lain is dividcd illlo I' 'quare suhdoll1ains. On average.Ihere are II, = ( s/N'iP + I)! nudes on each side of Ihe suhdomains. and Ihe front:tl widlh or µ; for an illlaior node un Ihe e1emcllI (i. j) is2(", + I) + 2i + I (the number of nodes covered by the shadow).

unknowns associated wilh the boundary nodcs must be held till all the unknowns on the interior nodes areeliminated. the frontal width associated with a given interior node (as shown in Fig. 4) is equal to

µ~= 2(11, + 1) + 2lk / II,J + I = 2(11, + Lk /II,J) (29)

where LrJ represents thc largcst illleger which is less than r. It is very difficult to derive a general formulation toevaluatc the frontal width associated with the nodes on the boundaries or intcrface boundaries. However. inassumption (26). we can simply use (29) 10 evaluate the frontal width associated with the boumlary nodes andthen the total number of arithmetic operations and the storage required to factor A can be estimated by

(30)

and

(31 )

I

": N312 £." L J l) 31'1, = 4P L..- (II + k / II )= 6 X---,=-= - N -.k~I" , ,/P JP

respectively. The experimental results indicate that the estimated values iJ and 1] will be close to the real 0 and 1Jif the assumption (26) holds.

Comparing (30) and (31) with (27) and (28). we can conclude that for the prohlcm shown in Fig. 3. themethod proposed here can reduce the operational counts and the nonzero entries in the Land U matrices to

283P

3and ,-p

v(32)

of thcir original values. rcspectively. In addition, a similar analysis indicates that for the three-dimensionalexample shown in Fig. 5. the multifronlal method can reduce the operational counts and the numher of thenonzero entries in the Land U matrices from 2N7I

·\ and 2N5/.l to

and4

p~/3 (33)

298 P. Gellg t't al. I COlIIl'lIt. Mt,tllOds Al'pl. Mel'''. I'·I/grg. /49 (1997) 289-301

The second leveldissection

The second leveldissection

The first leveldissection

The first leveldissection

(a)

(b)

3.t- --:rN ~

-/7:;r 1/'"

I

(

-

II

1.

V INlI-

IFig. 5. Th~ three-dimensional example, The uniform mesh of linear clements nn a cubic domain.

Fig. t'i. The maximum numhcr of working fronts required hy tlw factorization sequence la) is ~qual 10 I' (/' = -t for the cxample given here)and it happens after the firsl P steps ;Ire performcd. The sequence given in (h) reduces the maximulll number of working frollls (0 lug, P andIllost importantly. the "orage of workin~ fronts can be simply handled as a Slack.

of their original values. rcspectively.Finally. the symbolic factorization of the <.:onventiona] frontal mcthod is an O(N:') operation where N is the

length of the destination vector. However. the method proposed in this study breaks the original destinationvector into P independent vectors and reduce the symbolic fa<.:torization to an operation of

which will drasli<.:ally reduce the <.:lIstof the symbolic fa<.:torization.

4. Experiments

The multifrontal method proposed in this study has been implemented on the scquential machincs as well asthe distributed mcmory machines. The performance of the method was tested through a series of experiments onthe two- and three-dimensional examples amI wc here give a summary of those results .

../.1. Perforll/lIl1ce

Since the frontal width for the equations is detenllined after the symbolic factorization. the storage and theoperational counts can be exactly calculated by using (24) and (25). We calculated the nonzero entries of thestiffness matrix after the factorization and the operational counts of the l.V factorization for the two- andthree-dimensional examples as shown in Figs. 3 and 5 with a different number of linear or quadratic elements.Table I shows the result of a typical computation performed on the two-dimensional test problem with the meshof linear clements and Table 2 shows the result for the three-dimcnsional test problem. Here we are able toreduce the storage by 70% and the operational counts of the I.V factorization by nearly 90% by using thc propernumbers of fronts. The calculated results for the mesh of quadratic elements are similar and are not presented.

For the real computation. wc solve a two-dimensional Laplace problem on the two-dimensional lest problemwith a different size of mesh. Table 3 compares the computational time of using different numbers of fronts for

P. Gellg et (/1. I COI/Il'/II. Methods Al'pl, Mech. EII}:rg. I-IIJ (1997) 189-301 299

Single Processor Multi·Processors

m'41'fi"-j ---I-~-"Ilm'-_L .. J -, . __ ,-I I I:::- - [ [',~- t_ r -1' _ ,,' _ l

I , ' , I- -rr- -1," -1]1 " .L. -..- ,,-.-. -t--' '---- - -~.±tll ;-.-.- --:-= i'-r . ,e:::::._ .i. -~I

Fig. 7. The amlllgemcnts of the fronts for the leslS perfonned on Ihe phase 1\ work. l3e<:ause a perfecl loading balance <:an be ex pel' ted inthis C\periment. we ha\'c the parallel efficiency near 10 IOO'k.

Tahle IThe <:ompllrison of slOrage and operational counts in UJ facrorizalion for the different numher of fronts (the Iwo·dimensionalulliform meshof 1211X 1211= 16 )84 linear clements and 16641 degrees of freedoml where P represents the Ilumher of fronts

No. of fronls 11')

1'=11'= 16P=M1'=2%

:'1:0. of nonzero entries

4.29 X 10"2.R2 x 10" (65.11';;;' I1.79 X 10" (47.7'*)1.2R x 10" (29.90/, I

Opcralional <:ounts

5.646 X 10'2.666 X 10' 147.2'K)1.284 X 10' (21.9'K)0.853 x 10' (15.J<k)

Tahle 2The comparison (If storage and operational counts in LU factorization for Ihe different nllmhn of fronls (the three·dimensional IIniformmesh (If 32 x 32 x 32 = 32 76R linear e1entenls and 35937 degrees of freedom) when' I' represents the Ilumber of fronts

I64

256512

77.2 X 10"27.9 x 10" 06.1 'h I23.0 X 10" (29.7'k)22.3 x 10" (28.B'h)

K5.2 X 10"17.9 X HJ" (21.0%)Ih.2 X 10" 119.0Sf)

16.0 X 10" (IB.B%1

Tahle .1The comparison of Ihe lime spent on eadl step of computation among the different numhers of fronts. We here solve a two·dimensionalLaplace pmhlem on the uniform mesh of 64 X 64 = 4096 quadrati<: elements and 1664') degrees of freedom and the lest was performed onan IBM R6000 3BT machine. Here, I' also represents lhe number of fronts

I'

I1(,64

12K256

Symbolic factorization lsi

117.27.0 (ll.Oo/c)).5 (4.0%)3.4 (4.0%)4.4(5.1%)

FaclOrizalion and forward suhstitution (s)

11!!.7l1B.725.0 (21.1%)23.1 (19.5%)24.0 (20.2%)

Backward suhstitution (sl

K9.61I9.65,7 (6.)%)5.7 ((d%)7.6 (8.5<7, I

3(J() P. Celli: I't al. I COli/pili. MetllOds Al'pl. Mecll. ElIgrx. 149 (1997) 289-301

a test performed on a mesh of 64 X 64 = 4096 quadratic elements ( 16 641 degrees of frcedom). As shown inTable 3. the best n:sults are obtained when the 64 fronts are used for this size of the problems. and by using 64fronts. we can reduce the computational time on the symbolic factorization and the backward substitution by\)5% and the time of Ihc factorization and forward substitution by nearly 80%.

-1.2. 71/e para lie/ ill/p/ell/ellta/ion

The computation on cach front is completely independent unlil il touches other fronts and Ihis inherentparallelism can be exploited 10 develop a successful parallel sparse solver. FurthemlOrc. because the frontalmethod conducts all ics computation on a scries of dense linear systems, its parallel computation should beconsidered as a special application of the parallel solver for the dense linear system. and then the work on theparallel implementation is simplified through a use of a parallel dense solver package whidl we developed in aprevious research [61.

Most research on parallel multifrontal methods (see e.g. [4.5.111) merely uses the multi frontal method as amean of parallel computation and considers only the case of P = PI where I' is the number of the fronts and PIrcpresents the number of processors. In this study, we not only usc the multifrontal method for the parallelcomputation but also usc it as a means to minimize the fill in and our implementation l11ustbe established on amore general basis with P;;;' PI and PI ;;;.I. The entire implementation work was planned in three phases:

( I) thc sequential Illulli-frontal method (I' ;;;.I and PI = I).(2) the multi-processors with the single front on each processor (I) = PI and PI ;;;.I). and0) the general parallel multifrolllal method (P;;;' PI and PI;;;' I).Phase II implementation was completed on an 111le11PSC/860 machine and Fig. 7 shows some experimental

results. In this experiment. we compared the computational time spent on solving the same size of the problemsby using different numbers of processors. As shown in Fig. 7. the uniform meshes of 32 x 16 quadratic elements(the total degrees of freedom is 2275) and 32 X 32 quadratic elements (the total degrees of freedom is 4225) arcused. For the mesh with 32 X 16 quadratic elements. the single processor took about 76 s for solving a Laplaceproblem and it takes only 40 s for solving the same problem with cwo processors. II takes about 191 s for thesingle processor to solve a Laplace problem on the mesh of 32 X 32 quadratic elements and takes about 5] s forthe same problem solved using four processors. The experimental result simply indicated that the multifrontalmethod is able to give a good parallel efficiency in case there is a good loading balance among differentprocessors.

A more extensive experiment on the parallel computation will bc performed after the phase !II implementa-tion is completed, This part of the work has been moved to the IBM SP2 machine and the new implementationwill be based on MPI (Message Passing Interfacc). After completing this part of the work. \....e will be able toprovide a portable parallel multi frontal solver for solving problems in any parallel machine.

5. Future work and conclusion

The frontal method is originally designed for solving problems with positive linear systems and there is nochoice of pivoting in the method. Although numerous problems with indefinitc lincur systems have beensuccessfully solved by the frontal method. there is no theoretical justification for doing this. Considering thereality of the engineering computation. it is nccessary to provide sOllle pivoting function in any direct solver ifonc expects it to be used to solve the problellls with indefinite linear systems. In [8}. Hood proposed thatbecause the frontal method does everything in a series of linear systcms with dense stiffness matrices. one canattain limited pivoting choices with small modification on the original frontal method. The Hood's algorithm iscasy to implement ami is largely a retained performance. In (3J. Duff further proposed that one can combine afew ditferent fronts togelher to have larger fronts and achieve bettcr choices of pivoting. The Duffs method willinevitably sacrifice some efficiency but can provide bettcr numerical stability for problems with indefinite linearsystems.

The major unfinished work in this research is the implementation of pivoting in the solver. The linal versionof our solver should he ank to apply the Hood's pivoting method automatically during the computation andprovide the option of doing pivoting in the way proposed in 131.

P. (iellg et al. I COlIIl'lIt. AI.,tllOdsAPI'I. AI,'cll. ElIgrg. 149 (1997) 289-301 301

The obje~tive of this research is to develop an cfficient sparse solver for practical applications and the solvershould be ahle to work on the conventional sequcntial machines as well as the parallclmachines. The generality.rohustness and emcicn~y or the frontal method has bcen proved by many years' industrial and researchapplications. Our task herc is to develop a practical approach which can combine the classic frontal method withthe latest development. including the parallel computation. ordering and pivoting.

Acknowledgement

The support of the Office of Naval Research under Grant NOaa 14-95·1041 is gratefully acknowledged.

References

(1\ A. George, Nested dis~eclion or a rcgular finite clcmcnl mcsh. SIAM 1. NUlller. Anal. 10 (19721 .14:'\-363.[21 J,w. DCllllllel. S.c. Eisenstal. 1.R. Gilbert. X.S. Li and 1.W.11. Liu, 19'>:'\.A supcmodal approach to spar,e partial pivoting. Privatc

communicalion.[31 t.S. Duff and J.K. Reid. The mullifrontal solution of indefinite sparse symmetric lincar equations. AC~I Trans. Math. Software '>

(1973) 302-325.14) J.S. Duff. Parallel implementation of multifrontal scheme. Parallel COl1lput. 3 (1986) I93-2(l-t.15) C. Geist. Solving finitc elcmcl1I problcms with parallcl l1Iultifronlal schemes. in: M.T. Hcath, cd .. Hypercule Multiprocessors (SIAM.

Philadclphia, PA. 1987) 6.%-MI.[6\ P. Geng. 1.T. Oden and R.A. van de Geijn. Ma"ivcly parallel computation for acoustical scaltcring problems using houndary clcment

methods. J. Sound Vih. 191 (I) (1996) 145-165.[7] ~I.T. Heath. E. Ng and BW. Peyton. Parallel algorilhms for sparse linear systems. SIA~1 Rev. 33 (199\) 420-460.[8J P. Hood. Frontal solution program for unsymmetric matrices. Int. J. Numer. Melhods Engrg. 10 (1976) 379-399.[9J B. Irons, A frontal solUlion of program for finile c1cment analysis, Inl. 1. Numcr. ~lethods Engrg. 2 (1970) :'\-32.

110\ DJ. Rose, A graph· theoretic sllldy of the nUl1lerical solution of sparse posilive definitc system of linear equations. in: R.C. Read. ed ..Graph Theory and Compuling (Acadcmic Press, New York. 1972).

[Ill w.P. Zhang and E.M. Lui. A pal'allel frontal solver on the Alliant FX/XO. Comput. StruCI. .'x ( 1991) 203-215.

a parallel multifrontal algorithm and its implementation

Documents