[american institute of aeronautics and astronautics 15th aiaa computational fluid dynamics...

(c)2001 American Institute of Aeronautics & Astronautics or Published with Permission of Author(s) and/or Author(s)1 Sponsoring Organization.

A01 -31078 AIAA-2001-258815th AIAA Computational Fluid

Dynamics Conference11 -14 June 2001 Anaheim, CA

PARALLEL IMPLICIT EULER SOLVER ON HOMOGENEOUS ANDHETEROGENEOUS COMPUTING ENVIRONMENTS *

PeiZong Lee]" Chih-Hao Chang* and Jan-Jan Wu§

Institute of Information ScienceAcademia Sinica

Taipei, Taiwan, R.O.C.

In this paper, we present a domain decomposition approach for designing a parallel implicit Euler solver onworkstation clusters. First, the computing domain is tessellated by an unstructured mesh based on a backgroundquadtree. Using the Hilbert-Peano space-filling curve to traverse the quadtree and defining a sequential order,the background quadtree is also used to guide the partitioning of the unstructured mesh. This allows us topartition unstructured meshes for both homogeneous and heterogeneous computing environments. Then, wepropose a parallel implicit Euler solver based on a parallel symmetric approximate LU factorization iterativealgorithm (ALU). We show that the number of substitution steps for the lower sweep or the upper sweep ofthe parallel ALU algorithm is four. Experimental studies for the NACA0012 airfoil, the NASA EET wing andartillery shell within shock tube are reported.

1 Introduction

OVER the pact decade, a lot of efforts were dedicated tothe field of computational fluid dynamics. Especially,

the implementation of unstructured meshes, which are suit-able for the complex geometries, is popularly accepted insolving the Euler/Navier-Stokes equation. In this paper,we focus in implementing the unstructured Euler/Navier-Stokes solver on distributed memory parallel computers(DMPCs). We propose a novel partitioning strategy forhomogeneous and heterogeneous computing, and proposeparallel algorithms to accelerate the computation. We thenpresent experimental studies on a workstation cluster.

Parallel Euler/Navier-Stokes solvers for DMPCs wereproposed based on different capabilities, such as usingRunge-Kutta scheme for solving the partial differentialequations,1 dealing with structured meshes,2'3 and others.However, for huge computational problems, the implicit orthe multigrid methods are necessary to propagate the in-formation across the computing domain and to acceleratethe convergence rates. A complete survey of using variousimplicit methods for solving sparse linear systems, whichare arising from Euler/Navier-Stokes equations, in parallelenvironments can be seen elsewhere.4

Since the scale of the sparse linear system is often verylarge, it is not practical to use any direct method to solve it,as many non-zero entries may fill in the matrix, which re-sults in an unacceptable computational complexity and therequirement of huge memory space. Therefore, solving thesparse linear system by an iterative implicit method is moresuitable for the computation. To implement the iterative

*This work was partially supported by the NSC under Grants NSC89-2213-E-001-050 and NSC 89-2218-E-001 -003.

t Research fellow* Post-doctoral research fellow§ Associate research fellowCopyright © 2001 by the American Institute of Aeronautics and Astronautics,

Inc. All rights reserved.

implicit solver on the parallel environment, one must dealwith data dependence among different processing elements(PEs). That is, the information propagating throughout thecomputing domain must be taken into consideration on thepartition boundaries. Data dependence between differentPEs prevents the tasks in these PEs executing in parallel,this definitely degrades the execution performance on DM-PCs. A simple approach is to modify the linear systemand eliminate the data dependence between different PEs;5 however, it also might influence the convergence behav-ior of the whole system, especially for subsonic flow, whenlarge number of processors are used.6

The parallel algorithm for the Euler solver has to han-dle both data distribution and computation decomposition.Data distribution and computation decomposition can behandled using domain decomposition techniques. The do-main decomposition partitioning problem can be treatedas a graph-based partitioning problem that is shown tobe NP-complete,7 hence optimal solutions are computa-tionally intractable for large problems. However, severalheuristic methods were proposed to perform such partition-ing, such as using recursive coordinate bisection,8 recursivespectral bisection,9 geometry-based partitions,10 and othertechniques.

We will study both a graph-based domain decomposi-tion tool (METIS11) and a quadtree spatial-based domaindecomposition tool, the latter one will be presented in thispaper. We are interested in studying the load balance ofcomputation decomposition and the communication over-head arising from exchanging data among PEs using thesetwo domain decomposition tools, respectively.

For the quadtree spatial-based partitioning, a specificspace-filling curve that passes through every quadtreeleaf implicitly defines a sequential order among the two-dimensional quadtree leafs. Partitioning of quadtree leafsis then reduced to the partitioning of a linear sequence

1 OF 10

AMERICAN INSTITUTE OF AERONAUTICS AND ASTRONAUTICS PAPER 2001-2588


of weighted tasks, where tasks represent quadtree leafsand weights are the number of triangles falling withineach quadtree leaf. Among them, the Morton order andthe Hilbert-Peano order space-filling curves are frequentlyused.12-13 Note that we will frequently use a triangle or acell to represent an element in the unstructured mesh with-out confusion.

For dealing with parallel computation, each cell involvessome local computation and some global computation. Fordealing with the local computation of each cell in the un-structured mesh, we will maintain an overlap region in eachPE which stores remote accessed data from logically neigh-boring PEs. (PE; is logically neighboring w.r.t. PEj if andonly if some cells in partition i are neighbors of other cellsin partition j.) Then, the local computation of each cell in aPE can be done independently in that PE. For dealing withthe global computation of solving sparse linear systems,we propose a parallel symmetric ALU iterative algorithm,in which each PE only needs to send/receive data to/fromits logically neighboring PEs.

This paper includes the following three contributions.First, because the unstructured mesh is generated based ona background quadtree which specifies the density distri-bution among cells within the computing domain,14 it isnatural to use a quadtree spatial-based domain decomposi-tion tool to partition cells among PEs and it can achieve abetter load balance than the graph-based domain decompo-sition tool such as METIS. In addition, the graph-based par-titioning tool, which can only partition unstructured meshesinto partitions of similar sizes, is only suitable for the ho-mogeneous computing environment in which each PE hasthe same computing power. However, our quadtree spatial-based partitioning tool can decompose unstructured meshesinto partitions according to weight parameters for differentPEs. Thus, our method is especially suitable for the het-erogeneous computing environment in which each PE mayhave different computing power, such as different worksta-tions (or PCs) connected by a fast network. If all weightsare equal, then our method is reduced to the partitioning forthe homogeneous computing.

Second, after domain decomposition partitioning, whenperforming the parallel symmetric ALU algorithm, wefound that non-adjacent partitions can be executed simulta-neously.15 Reducing this problem to the chromatic numberproblem,1 for the partitioning of the 2D unstructured mesh,the number of non-adjacent sets is four which can be ob-tained by solving a four-color problem. Thus, the numberof substitution steps for the parallel lower triangular linearsystem and the parallel upper triangular linear system canbe reduced to four, respectively. Third, applying the paral-lel symmetric ALU algorithm, we can further overlap thecomputation time and the communication time.

The rest of this paper is organized as follows. Section 2formulates the implicit Euler flow solver. Section 3 pro-poses our quadtree spatial-based partitioning for domain

decomposition. Section 4 proposes our parallel algorithm.Section 5 presents experimental studies on a workstationcluster, and Section 6 gives some concluding remarks.

2 Implicit Euler flow solverIn the present work, we adopt the MUSCL type, up-

wind difference, finite-volume method to solve the Eulerequation. Consider the unsteady, inviscid, two-dimensionalfluid in a control volume, the integral conservation form ofthe Euler equations can be written as follow:

d_dt

F-ndS = 0, (1)

where V is an arbitrary area segment of the flow domain in7£2, with boundary 5 and outward unit normal vector n =(nx,ny), Q — (p, pu,pv,pE)T is the flow conservationstate vector, F • n = (/?<!>, pu<& -h pnx,pv$ + pny^pE^ +p$n)T is the flux vector normal to the boundary 5, p is thedensity, the vector v = (u,v) is the velocity vector, E isthe specific total energy, p is the pressure, $ — (v — Vb) •n, $n = v • n, and Vb is the velocity vector of the cell'sboundary.

By the use of Newton method, Equation (1) can be dis-cretized and formulated into the iterative implicit form thatcan be written as

/m+l

At I + ^-]AQ = -(sym+l _ Qmy

^dQ

orLHS AQ = RHS,

(2)

(3)

where ra and s are indices for main- and sub-iteration steps,respectively, AQ = Qs+l - Qs is the increment of con-servation states, and R is the summation of numerical fluxon the cell's boundary. The numerical flux is calculatedby the Frink's scheme16 with Roe's approximate Riemannsolver.17 The flux Jacobin ^£1 is approximated by the lin-earization method proposed by Earth.18 Because for thetwo-dimensional triangular mesh, each cell has at mostthree neighboring cells, thus, the corresponding sparse lin-ear system can be treated as a four-stencil problem, inwhich each row or column in the sparse matrix has at mostfour non-zero entries. Then, the successive conservationstates Qs+l can be calculated by solving this sparse linearsystem.

Since the number of cells is normally a very large in-teger, it is not practical to use any direct method to solvethis sparse linear system, because many non-zero entriesmay fill in the matrix, which results in requiring an un-affordable huge space in the main memory. In order toimplement the Euler flow solver on DMPCs, we adopt theALU iterative algorithm19 because it converges faster thanJacobi and Gauss-Seidel iterative algorithms and it has highparallelism.

2 OF 10



Let LHS be (D + L 4- U), where D is a block diagonalmatrix, L and U are a block lower triangular matrix and ablock upper triangular matrix with zero diagonal elements,respectively. Then, the sparse linear system can be writtenas follows.

(4)

A symmetric ALU scheme factorizes the left hand side ofEquation (4) as follows.

For each odd iteration, we solve;

(D + L)D~l(D + 17) AQ - RHS; (5)

for each even iteration, we solve;

(D + U)D~l (D + L) AQ - RHS. (6)

Then, a two-sweep inversion process can be formulated asfollows. For each odd iteration, we perform a low sweepand then an upper sweep, where

lower sweep : (D + L) AQ* = RHS, (7)upper sweep : (D + [7)AQ = RHS - LAQ*. (8)

For each even iteration, we perform an upper sweep andthen a low sweep, where

upper sweep : (D + U) AQ* = RHS, (9)lower sweep : (D + L)AQ = RHS - E7AQ*. (10)

The time complexity of each lower sweep or uppersweep is linear w.r.t. the number of cells in the unstruc-tured mesh.

3 Quadtree Spatial-Based PartitioningWe use a background quadtree to guide the generation of

unstructured meshes and the partitioning of these unstruc-tured meshes.3.1 Delaunay meshes generated based on a background

quadtree

Theoretically, using the Steiner point insertion strat-egy for Delaunay triangulation to generate unstructuredmeshes, the aspect ratio is smaller than 4.31; the area ra-tio is smaller than 3; and the edge ratio is smaller than 2.Among them, the aspect ratio of a triangle T is the ratioRT/TT, where RT is the radius of the smallest circle con-taining T (circumcircle) and TT is the radius of the largestcircle contained in T (inscribed circle).20 The best aspectratio is 2 due to the equilateral triangle.

The area ratio between two adjacent triangles is the ratioof the large area to the small area of these two triangles.The best area ratio is 1 , where two adjacent triangles havethe same area. The edge ratio of a triangle is the ratio ofthe length of the longest edge to the length of the shortestedge. The best edge ratio is 1, in the case of an equilateral

triangle. The aspect ratio, area ratio, and edge ratio of atriangulation mesh are the largest (worst) aspect ratio, arearatio, and edge ratio among its triangles, respectively.

In order to improve mesh quality, our unstructured meshis generated based on a background quadtree which spec-ifies the density distribution among cells within the com-puting domain.14 Using Bowyer-Watson Delaunay trian-gulation algorithm along with Steiner point insertion andlocal refinements to generate unstructured meshes, for thecase of NACA0012 airfoil, we can generate a mesh consist-ing of 12046 nodes and 23791 cells, where the aspect ratiois 3.25, the area ratio is 2.06, the edge ratio is 1.93, and noobtuse triangle in the mesh. For the case of multi-elementsNASA EET wing (Energy Efficient Transport wing), wecan generate a mesh consisting of 13796 nodes and 27070cells, where the aspect ratio is 3.37, the area ratio is 1.98,the edge ratio is 1.97, and no obtuse triangle in the mesh.

3.2 Hilbert-Peano space-filling curve

The background quadtree, which indicates a smoothchange of density distribution, can be used to guide the do-main decomposition. First, we use a Hilbert-Peano space-filling curve to traverse every quadtree leaf and define a se-quential order among all quadtree leafs. Fig. l-(a) shows aHilbert-Peano space-filling curve traversing through a reg-ular mesh; Fig. l-(b) shows a Hilbert-Peano space-fillingcurve traversing through a density quadtree of a blunt body.Each quadtree leaf is associated with a density rank. If thedensity rank is small, it represents a high density; if thedensity rank is large, it represents a low density. The den-sity rank implicitly restricts the edge length of a triangle.Thus, the area of a quadtree leaf with a small density rankincludes more triangles than that of the same area with alarge density rank.

(a) (b)

Fig. 1 A Hilbert-Peano space-filling curve traverses through(a) a regular mesh and (b) a density quadtree of a blunt body.

Second, we identify each triangle falling within a spe-cific quadtree leaf according to the location of its gravitycenter. After that, we apply a bucket sort for all trianglesaccording to the sequential order defined by the Hilbert-Peano space-filling curve. Now, partitioning of an un-structured mesh is reduced to the partitioning of a linearsequence of weighted tasks, where each task represents aquadtree leaf and whose weight is the number of triangles

3 OF 10



falling within that quadtree leaf.Additional refinement can be used to improve the qual-

ity of the partitioned mesh. We examine the backgroundquadtree and swapping cells, which are adhered to thequadtree, between adjacent partitions to make the partitionboundaries more integrated. In order to reduce the commu-nication overhead, we iteratively examine the boundariesbetween adjacent partitions and make the cells swap toneighboring partition, if we find that boundary edges canbe reduced after the swapping. Then we adjust the cellsbetween adjacent partitions to balance the computing load,under the condition that the boundary edges does not in-crease after the adjustment.3.3 Experimental studies of partitioning

It is instructive to give a quantitative comparison for theload balance and the communication overhead due to usingthe graph-based partitioning tool METIS and our quadtreespatial-based partitioning tool.

3.3.1 Partitioning for homogeneous computingenvironment

Both METIS and our quadtree spatial-based partitioningtool can generate a quite good partitioning for homoge-neous computing, in which each PE has the same comput-ing power. Consider the unstructured mesh of NACA0012airfoil, which includes 12046 nodes and 23791 cells.Fig. 2-(a) and -(b) show eight partitions generated byMETIS and our tool, respectively.

Table l-(a) and -(b) summarize the statistic of adjacentcells between each pair of eight partitions for the mesh ofNACA0012 airfoil generated by METIS and our quadtreespatial-based partitioning tool, respectively. The value inentry (pi,pi) means the number of cells in partition pi andthe value in entry (p^pj), for i ^ j, means the number ofcells in partition PJ , where each of these cells has at leastone apex shared with some cells in partition p^. When thevalue in entry (p-,pj) is not zero, there will be an commu-nication message to send/receive data between partition piand PJ.

Since the states of fluid are basically defined on cellsfor the finite-volume method. The communicating databetween partitions is determined by the number of cells ad-jacent to the partition boundary, instead of the number ofedges on partition boundary. Therefore, we can define thecommunication cost to be the summation of value in en-try (pi,Pj)> when i / j. If we reference to Table l-(a),there are in total 34 off-diagonal non-zero entries whosesum is 2048, which means that it requires to send/receive34 messages including data of 2048 cells to/from neighbor-ing partitions.

In addition, we can see that the difference of the largestnumber of cells 3055 at entry (p$, p§) in partition 6 and thesmallest number of cells 2902 at entry (p2» P2) in partition 2is 153. This means that the load imbalance factor is 153,which is about 5.1% to the optimal value of cells in each

Fig. 2 Eight partitions of NACA0012 airfoil generated by (a)METIS and (b) our quadtree spatial-based partitioning tool.

partition.From Table l-(b), we can see that there are in total 30 off-

diagonal non-zero entries, whose sum is 2521. In addition,the load imbalance factor is 1, which is the best case forthis mesh.

Two additional cases, including the NASA EET wingand and artillery shell within shock tube, are show in Fig. 3and Fig. 4. A detail comparison the partitioned mesh arelisted in Table 2. The partitions generated by the METIStool and by the quadtree spatial-based partition tool be-fore/after the refinement are shown in the table.

In summary, our partitioning tool ensures a betterload balance, a smaller number of messages sent/receivedamong logically neighboring PEs, but with relatively largemessage sizes.

4 OF 10



Metis

Quadtree spatial-based(No Refinement)

Quadtree spatial-based(with Refinement)

NACA0012 airfoilC M L

2059 17 153(5.1%)

3071 15 33(1.1%)

2521 15 1(0.0%)

NASA EET wingC M L

1976 15 198(5.9%)

3734 17 27(0.8%)

2798 14 1(0.0%)

Artillery shellC M L

4211 12 935(4.8%)

7124 13 8(0.04%)

5878 13 1(0.0%)

Table 2 The quality of partitions for cases of NACA0012 airfoil, NASA EET wing, and artillery shell within shock tube. Eightpartitions are decomposed homogeneously by METIS and our quadtree spatial-based partitioning tool. (C,M,L) is representedfor (communication cost, number of massages, load imbalance factor)

PjPi

Part

ition

No.

PiP2

P3P*PSP6

P^P8

Partition No.Pi

292312600160310

P21232902381360070

P3038

3041711005300

P40

13369

29230432055

P5170980

3025569927

P600564662

3055051

P73311022980

292693

P800054295197

2996

(a)

Pjpi

Part

ition

No.

PiP2P3P4P5P6P7PS

Partition No.Pi

2974750021600

130

P274

2974119069000

P30

121297413819000

P400

1412974870900

P519681992

2974991350

P66500093

29745275

P70008912550

297496

P813100007891

2973

(b)

Table 1 The statistic of adjacent cells between each pair ofeight partitions for the mesh of NACA0012 airfoil generatedby (a) METIS and (b) our quadtree spatial-based partitioningtool. The value in entry (pi,pi) means the number of cells inpartition pi and the value in entry (p^, PJ), for i ^ j, meansthe number of cells in partition PJ, where each of these cellshas at least one apex shared with some cells in partition pi.

3.3.2 Partitioning for heterogeneous computingenvironment

METIS cannot but our tool can generate partitions withunequal weights. This unequal-weight partitioning is usedfor heterogeneous computing, in which different PEs mayhave different computing power such as different worksta-tions connected by a fast network. For instance, a hetero-geneous computing environment includes four Ultrasparc -1's and eight Ultrasparc-2's connected by a fast Ethernet,where the CPU clock of the Ultrasparc-1 is 169 MHz(Mega Hertz) and the CPU clock of the Ultrasparc-2 is300 MHz. Suppose that we need to decompose the meshof NACAOO12 airfoil into 12 partitions, among them eachweight of the first four partitions for four Ultrasparc- 1's is 1and each weight of the remaining eight partitions for eightUltrasparc-2's is 2.

The results of heterogeneous partitioning for

Fig. 3 Eight partitions of NASA EET wing generated by thequadtree spatial-based partitioning tool.

Fig. 4 Eight partitions of artillery shell within shock tubegenerated by the quadtree spatial-based partitioning tool.

NAC AOO 12 airfoil and NASA EET wing is shown inFig. 5 and Fig. 6. The weight partition-(l-4) is 1, and theweight for partition-(5-12) is 2. Table 3 summarizes theresults of partitioned mesh. We can find that our quadtreespatial-based partitioning tool behaves very well in loadbalance for arbitrary weight of partitions, even withoutany refinement. Then the load imbalance can be easilyeliminated by the refinement procedures.

5 OF 10



Quadtree spatial-based(No Refinement)

Quadtree spatial-based(with Refinement)

NACA0012 airfoilC M L

4277 24 22(1.8%)

3212 24 1(0.0%)

NASA EET wingC M L

4652 30 12(0.9%)

3266 26 0(0.0%)

Artillery shellC M L

9801 26 5(0.06%)

7926 25 1(0.0%)

Table 3 The quality of partitions for cases of NACA0012 airfoil, NASA EET wing, and artillery shell within shock tube. Twelvepartitions are decomposed heterogeneously by our quadtree spatial-based partitioning tool. The weight is 1 for Partition-(l-4)and 2 for partition-(5-12). Symbol (C,M,L) is represented for (communication cost, number of massages, load imbalance factor)

Fig. 5 Twelve partitions of NACA0012 airfoil generated bythe quadtree spatial-based partitioning tool. The weight is 1for Partition-(l-4) and 2 for partition-(5-12).

4 Parallel Euler solver

For dealing with parallel computation, each cell involvessome local computation and some global computation. Inorder to maintain data locality, all data related to cellsin partition i are stored in PE^, these data include nodes,edges, areas, flow conservation state vectors, and flux vec-tors. Although for computing flow conservation state vec-tors and flux vectors in each cell, only data in its neigh-boring cells are used, it is possible that some neighboringcells are stored in other logically neighboring PEs. Thus,we maintain an overlap region for each PE, which stores re-mote accessed data from logically neighboring PEs. Then,the local computation of each cell in a PE can be done in-dependently in that PE.4.1 Bounds for the ALU substitution steps

For dealing with the global computation of solvingsparse linear systems, we now present a parallel symmet-ric ALU iterative algorithm, in which each PE only needsto send/receive data to/from its logically neighboring PEs.First, we renumber partitions generated by a domain de-composition tool as follows. Because the unstructured

Fig. 6 Twelve partitions of NASA EET wing generated bythe quadtree spatial-based partitioning tool. The weight is 1for Partition-(l-4) and 2 for partition-(5-12).

mesh is two-dimensional, after applying a domain decom-position tool for decomposing the unstructured mesh intoP partitions, the neighboring relationship among these Ppartitions can be represented by a planar graph. Becausenon-adjacent partitions can be executed simultaneously,each set of non-adjacent partitions represents a substitutionstep for the lower sweep or the upper sweep of the paral-lel symmetric ALU algorithm. The minimum number ofsubstitution steps for the lower sweep or the upper sweepin the parallel symmetric ALU iterative algorithm is equalto the minimum number of independent sets, which parti-tion the nodes in the planar graph. This specific problemis reduced to the chromatic number problem: What is theminimum number of colors to color nodes in a graph suchthat adjacent nodes get different colors, where each colorcorresponds to one independent set.

As a clique of degree four is a planar graph but a cliqueof degree five is not a planar graph, the degree of the max-imum clique in a general planar graph is four. Therefore,we need at least four colors to color a general planar graph.Thus, the lower bound of the number of substitution stepsfor the lower sweep or the upper sweep in the parallel sym-

6 OF 10



metric ALU iterative algorithm is four.As for the upper bound, it is well known that a map can

be colored by only four colors such that colors for adjacentcountries are different. This is the well-known four-colorproblem for a general planar graph. Thus, the upper boundof the number of substitution steps for the lower sweep orthe upper sweep in the parallel ALU iterative algorithm isalso four for the two-dimensional unstructured mesh inde-pendent of the number of partitions.15 An illustration ofthe implicit parallel ALU iterative algorithm is shown inFig. 7-(a). Where localJ means the local computation atiteration i, and PEj represents the set of PEs correspondingto the j-th set of non-adjacent partitions. ALUiow and ALUupare the lower sweep and the upper sweep of the symmetricALU algorithm, respectively.4.2 A parallel algorithm for DMPCs

It is worthwhile to mention that for consecutive itera-tions, the computation time and the communication over-head can be overlapped as shown in Fig. 7-(b). The parallelalgorithm for one iteration of the Euler flow solver consistsof five steps. Suppose that there are P PEs.

Step 1: For each PE$, if PEj is logically neighboringw.r.t. PEj, then PE; sends data of neighboring cells w.r.t.cells in partition j to PEj, for 1 < z, j, < P. PE; also re-ceives remote accessed data from its logically neighboringPEs and stores them in the overlap region.

Step 2: For each PE&, if cell i is in partition k, then PE&computes entries in LHS and RHS w.r.t. that cell i usinglocal data and data in the overlap region, for 1 < k < P.For each odd iteration, all PEs execute Step 3 and Step 4:

Step 3: Lower sweep in Equation (7) is solved as fol-lows. For each PEj where 1 < j < P, after receiv-ing (A<3*)'s from all of its logically neighboring PE^ for1 < * < J < P, then PEj can compute its own (A<2*)'s.After that, PEj sends its (AQ*)'s to all of its logicallyneighboring PEk for 1 < j < k < P.

Step 4: Upper sweep in Equation (8) is solved as fol-lows. For each PEj where 1 < j < P, after receiv-ing AQ's from all of its logically neighboring PE& for1 < j < & < -P> then PEj can compute its own AQ's.After that, PEj sends its AQ's to all of its logically neigh-boring PE; for 1 < i < j < P.For each even iteration, all PEs execute Step 3' and Step 4':

Step 3': Upper sweep in Equation (9) is solved as fol-lows. For each PEj where 1 < j < P, after receiv-ing (AQ*)'s from all of its logically neighboring PE& for1 < J < & < -P, then PEj can compute its own (AQ*)'s.After that, PEj sends its (AQ*)'s to all of its logicallyneighboring PEj for 1 < i < j < P.

Step 4': Lower sweep in Equation (10) is solved as fol-lows. For each PEj where 1 < j' < P, after receiving AQ'sfrom all of its logically neighboring PE^ for 1 < i < j < P,then PEj can compute its own AQ's. After that, PEj

sends its AQ's to all of its logically neighboring PE& for1 < j < k < P.

Step 5: Each PE^ updates flow conservation state vectorsand other related data w.r.t. cells in partition z, for 1 < i <P.4.3 Performance analysis

The computation of the parallel Euler flow solver in-cludes some local computation for computing flow conser-vation state vectors and flux vectors in each cell and someglobal computation for solving sparse linear systems usinga parallel ALU algorithm. Let the total execution time on asingle PE be TOTAL = LOCAL+ALU, where TOTAL is the to-tal execution time, LOCAL is the execution time for the localcomputation, and ALU is the execution time for solving thesparse linear systems using the ALU algorithm. Supposethat there are in total P PEs, The theoretical execution timeand speedup for the ALU iterative algorithms are summa-rized in Table 4, where Comm_l, Comm_2, and Comm_3 are thecommunication overhead and the subscript P stands for thenumber of PEs.

5 Experimental studies on DMPCs

We use 4 dual-processors Ultrasparc-2 workstations con-nected by a 100-Mbps (Mega bit per second) fast Ethernetnetwork as our experimental environment. Three test cases,two for steady solution and one for unsteady one, are usedto evaluate the performance of the solver. The first case isthe NACA0012 airfoil of 23791 cells. The Mach number is0.8 and angle of attack is 3°. The second case is three-elements NASA EET airfoil of 26900 cells. The Muchnumber is 0.2 and angle of attack is 20°. The last caseis an artillery shell within shock tube. The incoming shockis moving from the left and the inflow Mach number is 1.2.In order to ensure the temporary solution converge, 10 sub-iterations are executed within each time step.

The results for these test cases are shown in Fig. 8-(a),-(b) and -(c), respectively. Performance of the paralleliterative algorithms are summarized in Table 5, Table 6,and Table 7. Two sets of times, one for METIS-partitionedmesh and the other for the quadtree spatial-based parti-tioned mesh, are shown in each table, respectively.

The results show that the overlapped optimization is ef-fective in promoting the performance of the iterative algo-rithms. Up to 30% improvement of performance can beachieved in our experiments. If we implement the four-color and overlapped optimization in the same time, at most52% improvement of performance can be achieved for thecase of artillery shell.

We are interesting in the comparison between METISand quadtree spatial-based partitioned meshes. Because thetime for communication is relatively small part in the to-tal execution time, We expect that the performance gainedfrom the well balanced partitioned mesh can compensatethe performance lost in communication. But the results

7 OF 10



Step PE1 PE2 PE3 PE4

kk+1 local_i local_i local_i local_ik+2 ALUJowk+3 ALUJowk+4 ALUJowk+5 ALUJowk+6 ALU_upk+7 ALU_upk+8 ALU_upk+9 ALU_upk+10 localji+l) localji+l) localji+l) localji+l)k+11 ALU_upk+12 ALU_upk+13 ALU_upk+14 ALU_upk+15 ALUJowk+16 ALUJowk+17 ALUJowk+18 ALUJow

(a)

PE1 PE2 PE3 PE4

ALUJowlocaljALUJow

ALU.uplocaL(i+l)ALU_upALUJowlocal_(i+2)ALUJow

ALUJowlocalj ALUJowALUJow localj

ALUJow

ALU_upALU_up local_(i+l)localji+l) ALU_upALU_up

ALUJowlocalJi+2) ALUJow

ALUJowlocaljALUJowALU_uplocalji+l)ALU_up

ALUJow localJi+2) ALUJowALUJow localJi+2)

ALUJow

(b)

Fig. 7 The illustration for the parallel iterative algorithms: (a) non-overlapped ALU algorithms, (b) overlapped ALU algo-rithms, localj means the local computation at iteration i. PEj represents the set of PEs corresponding to the j-th set ofnon-adjacent partitions. ALUJow and ALU_up are the lower sweep and the upper sweep of the ALU algorithm, respectively.

execution time speedupsequential timeparallel execution timewithout 4-color opt.parallel execution timewith 4-color opt.parallel execution timewith 4-color and overlapped opt.

TOTAL = LOCAL + ALU

TOTALp = (LOCAL/P) + ALU + Coimn_l

TOTAL), = (LOCAL/P) + 4(ALU/P) + Comm_2

TOTALp 0 = (LOCAL/P) + 2.5(ALU/P) + Comm_3

1

TOTAL/TOTALp

TOTAL/TOTAL^

TOTAL/TOTAL^

Table 4 Sequential/parallel execution times with/without four-color and overlapped optimization for ALU iterative algorithms,where LOCAL is the execution time for local computation on a single PE, ALU is the execution time for solving sparse linear systemsusing the ALU algorithm on a single PE, and Comm_l, Comm_2, and Comm_3 are the communication overhead on P PEs.

show that the performance for the METIS partitioned meshis higher then the other one by about 3%. Since the incre-ment in the total execution time is larger then the increasedtime from communication overhead, this problem still needmore investigation in the future.

We have also run the solver on a heterogeneous net-work consisting of eight Ultrasparc-2 workstations andthree Ultrasparc-1 workstations. The ratio of processingspeed between these two types of machines is about 2. Wecompare two cases, one is partitioning the mesh into 11non-equal-sized partitions according to the speed ratio, theother is decomposing the mesh into 11 equal-sized parti-tions. The execution time of the NACA0012 airfoil is 0.87seconds per iteration with heterogeneous partitioning, and1.19 seconds per iteration with homogeneous partitioning.

The result demonstrates that taking heterogeneity into con-cern improves the execution time of the solver by about27%.

However, the performance of the solver on the heteroge-neous network of eleven workstations (0.87 seconds) is in-ferior to that on a homogeneous system of eight Ultrasparc-2 workstations (0.75 seconds). We have observed that thecommunication time between two different types of work-stations is much higher than that on homogeneous ones, anindication that MPICH 1.1.2 (the MPI implementation thatwe used for message passing between processing nodes)may not handle communication between heterogeneousworkstations efficiently. We expect that the performance ofthe solver on the heterogeneous system will be much im-proved when newer version of MPICH becomes available.

8 OF 10



#PEs no optimizationtime, comm, speedup

with overlapped opt.time, comm, speedup

with 4-color opt.time, comm, speedup

with 4-color and overlapped opt.time, comm, speedup

Mesh decomposed by METIS partitioner248

1.01,0.122,1.400.54,0.086,2.610.45,0.071,3.13

0.99,0.105,1.420.59, 0.062, 2.390.37,0.059,3.81

1.01,0.118, 1.400.56, 0.093, 2.520.36, 0.090, 3.92

0.97,0.103,1.450.52,0.063,2.710.34,0.056,4.15

Mesh decomposed by quadtree partitioner248

1.03,0.129, 1.370.59, 0.088, 2.390.47, 0.089, 3.00

0.98,0.117, 1.440.58, 0.059, 2.430.40, 0.067, 3.53

1.05,0.134, 1.340.73,0.101,1.930.48, 0.098, 2.93

0.98,0.107, 1.440.53, 0.062, 2.660.38,0.061,3.71

Table 5 Per-iteration time (total time, communication time, speedup) of NACA0012 airfoil (23791 cells) on different number ofprocessors. Uniprocessor time is 1.41 seconds.






1.03,0.124, 1.530.61,0.082,2.590.47, 0.072, 3.36

1.00,0.109, 1.580.58,0.061,2.720.36, 0.060, 4.39

1.02,0.119, 1.550.63,0.087,2.510.40, 0.075, 3.95

0.99,0.111, 1.600.56, 0.065, 2.820.35,0.059,4.51


1.10,0.131, 1.440.68, 0.089, 2.320.52,0.081,3.04

1.02,0.118, 1.550.63,0.069,2.510.39, 0.069, 4.05

1.12,0.138, 1.410.70, 0.092, 2.260.44, 0.082, 3.59

1.00,0.117, 1.580.60, 0.068, 2.630.37,0.061,4.27

Table 6 Per-iteration time (total time, communication time, speedup) of NASA EET Wing (26900 cells) on different number ofprocessors. Uniprocessor time is 1.58 seconds.






62.03,3.77, 1.4336.89,3.21,2.4127.11,2.44,3.28

54.90,2.88, 1.6231.21,2.55,2.8523.04, 1.92,3.86

61.00,3.95,1.4630.09, 3.27, 2.9618.56,2.48,4.80

56.50,3.01, 1.5829.88, 2.60, 3.0017.70, 1.77,5.03


62.51,3.95, 1.4237.60, 3.42, 2.3728.12,2.71,3.17

55.50,3.08, 1.6031.71,2.87,2.8123.86,2.10,3.73

61.38,4.01,1.4537.30, 3.47, 2.3919.42, 2.78, 4.58

56.60,3.22,1.5730.74, 2.78, 2.9018.42,1.95,4.83

Table 7 Per-iteration time (total time, communication time, speedup) of Artillery shell within shock tube (156412 cells) ondifferent number of processors. Uniprocessor time (including 10 sub-iterations) is 89.00 seconds.

6 Conclusion

We have shown in this paper a domain decompositionapproach for designing a parallel implicit Euler solver onhomogeneous and heterogeneous computing environments.The unstructured mesh which tessellates the computing do-main is decomposed into partitions based on a quadtreespatial-based partitioning method. We use the Hilbert-Peano space filling curve to define an order for the quadtreeleafs and then transform the mesh partitioning problem topartition a sequence of weighted tasks. This allows us topartition unstructured meshes for both homogeneous andheterogeneous computing environments.

The parallel implicit Euler solver includes some localcomputation and some global computation. For the localcomputation, we maintain an overlap region to store remoteaccessed data, so that all local computations can be doneindependently in each PE. For the global computation ofdealing with the symmetric ALU, we show that the num-ber of substitution steps for the lower sweep and the uppersweep of the parallel ALU is four. We have run cases for

the NACA0012 airfoil and the NASA EET wing on a work-station cluster. Experimental studies show that our domaindecomposition approach is promising.

References1Satofuka, N., Obata, M., and Suzuki, T., "Parallel Computation of

Super-/Hypersonic Flows on Workstation Network and Transputer Ar-rays," Parallel Computing, Vol. 23, 1997, pp. 1293-1305.

2Averbuch, A., loffe, L., Israeli, M., and Vozovoi, L., "Two-Dimensional Parallel Solver for the Solution of Navier-Stokes Equationswith Constant and Variable Coefficients Using ADI on Cells," ParallelComputing, Vol. 24, 1998, pp. 673-699.

3di Seraifino, D., "A Parallel Implementation of a Multigrid Multi-block Euler Solver on Distributed Memory Machines," Parallel Comput-ing, Vol. 23, 1997, pp. 2095-2113.

4Venkatakrishnan, V, "Implicit Scheme and Parallel Computing inUnstructured Grid CFD," ICASE Report 95-28, Inst. for Computer Ap-plications in Science and Engineering, NASA Langley Research Center,Hampton, VA, 1995.

5Venkatakrishnan, V, "Parallel Implicit Unstructured Grid EulerSolvers," AIAA Journal, Vol. 32, No. 10, Oct. 1994, pp. 1985-1991.

6Jensen, C. B., "Implicit Multiblock Euler and Navier-Stokes Calcu-lations," AIAA Journal, Vol. 32, No. 9, 1994, pp. 1808-1814.

7Garey, M. R. and Johnson, D. S., Computers and Intractability, W.H. Freeman and Co., San Francisco, 1979.

9 OF 10



(a)

Fig. 8 The computation results of present work: (a) Densitycontours for the NACA0012 airfoil. The Mach number andangle of attack for the incoming flow are 0.8 and 3°. (b) Machcontours for the NASA EET wing. The Mach number andangle of attack are 0.2 and 20°. (c) Density contours for theartillery shell within shock tube. The Mach number of theincoming flow is 1.2

(b)

Fig. 8 Continued

8Farhat, C. and Lesoinne, M., "Automatic Partitioning of Unstruc-tured Meshes for the Parallel Solution of Problems in ComputationalMechanics," International Journal for Numerical Methods in Engineer-ing, Vol. 36, 1993, pp. 745-764.

9Pothen, A., Simon, H. D., and Liou, K. P., "Partitioning Sparse Ma-trices with Eigenvectors of Graphs," SI AM J. Matrix Anal. Appl., Vol. 11,1990, pp. 430-452.

10Miller, G. L., Teng, S.-H., Thurston, W., and Vavasis, S. A., "Auto-matic Mesh Partitioning," Graph Theory and Sparse Matrix Computation,Vol. 56 of The IMA volumes in mathematics and its applications, SpringerVerlag, 1993, pp. 57-84.

uKarypis, G. and Kumar, V, "METIS: A Software Package for Par-titioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices, Version 4.0," Dept. of ComputerScience, Univ. of Minnesota, Sep. 1998.

12Ou, C. W., Ranka, S., and Fox, G., "Fast and Parallel Mapping Algo-rithms for Irregular Problems," The Journal of Supercomputing, Vol. 10,1996, pp. 119-140.

(C)

Fig. 8 Continued

13Warren, M. S. and Salmon, J. K., "A Parallel Hashed Oct-Tree N-Body Algorithm," Proc. Supercomputing'93, 1993.

14Lee, P.-Z. and Chang, C.-H., "Unstructured Mesh Generation UsingAutomatic Point Insertion and Local Refinement," Proc. National Com-puter Symposium, Taipei, Taiwan, Dec. 1999, pp. B550-B557.

15Lee, P.-Z., Chang, C.-H., and Chao, M.-J., "A Parallel Euler Solveron Unstructured Meshes," Proc. ISCA 13th International Conference onParallel and Distributed Computing Systems, Las Vegas, Nevada, Aug.2000, pp. 171-177.

16Frink, N. T., "Tetrahedral Unstructured Navier-Stokes Method forTurbulent Flow," AIAA Journal, Vol. 36, No. 11, Nov. 1998, pp. 1975-1982.

17Hirsch, C., Numerical Computation of Internal and External Flows,Volume 2: Computational Methods for Inviscid and Viscous Flows, JohnWiley & Sons, Inc, 1988.

18Barth, T. J., "Analysis of Implicit Local Linearization Techniquesfor Upwind and TVD Algorithms," AIAA paper 87-0595, American In-stitute of Aeronautics and Astronautics, 1987.

19Pan, D. and Cheng, J. C., "Upwind Finite-Volume Navier-StokesComputations on Unstructured Triangular Meshes," AIAA Journal,Vol. 31, No. 9, Sep. 1993, pp. 1618-1625.

20Teng, S.-H. and Wong, C. W., "Unstructured Mesh Generation: The-ory, Practice, and Perspectives," International Journal of ComputationalGeometry & Applications, 1999, to appear.

10 OF 10


[american institute of aeronautics and astronautics 15th aiaa computational fluid dynamics...

Documents