lehrstuhl fur informatik 10 (systemsimulation) · lehrstuhl fur informatik 10 (systemsimulation)...

FRIEDRICH-ALEXANDER-UNIVERSITAT ERLANGEN-NURNBERGINSTITUT FUR INFORMATIK (MATHEMATISCHE MASCHINEN UND DATENVERARBEITUNG)

Lehrstuhl fur Informatik 10 (Systemsimulation)

Solving Finite Element Systems with Hypre and Z88

Stephan Helou

Bachelor Thesis

Solving Finite Element Systems with Hypre and Z88

Stephan HelouBachelor Thesis

Aufgabensteller: Prof. Dr. U. Rude, Prof. Dr.-Ing. F. Rieg

Betreuer: Dipl.-Inf. T. Gradl

Bearbeitungszeitraum: 23.08.2009 – 23.11.2009

Erklarung:

Ich versichere, daß ich die Arbeit ohne fremde Hilfe und ohne Benutzung anderer als der angegebe-nen Quellen angefertigt habe und daß die Arbeit in gleicher oder ahnlicher Form noch keiner anderenPrufungsbehorde vorgelegen hat und von dieser als Teil einer Prufungsleistung angenommen wurde. AlleAusfuhrungen, die wortlich oder sinngemaß ubernommen wurden, sind als solche gekennzeichnet.

Erlangen, den 10. Marz 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Abstract

In time, the problems concerning linear elasticity and the finite element method have always becomelarger and more complex. The aim of this thesis is to test different parallel solvers and preconditionersand to compare the achieved results with the algorithms of the finite element program Z88. This isdone by calculating displacements of different component parts with more than 2 million degrees offreedom. Using the hypre framework provides various highly parallel solvers and preconditioners, forinstance BoomerAMG, Conjugate Gradient or ParaSails. The methods are tested considering their solvingtime, number of iterations and residuals. The parallel Conjugate Gradient algorithm using ParaSails as apreconditioner is the fastest combination of the thesis. With only 8 CPUs it gets up to 110 seconds fasterthan the Z88 methods. Having more CPUs it calculates the problems up to 140 seconds faster. Solvingthe biggest component part with 243 CPUs is even 22.3 minutes faster. According to these values andthe more complex problems, the usage of parallel solvers is indispensable.

Zusammenfassung

Mit der Zeit wurden die Probleme im Bezug auf lineare Elastizitat und Finite Elemente Methode immergroßer und komplexer. Das Ziel dieser Arbeit ist es die verschiedenen parallelen Loser und Vorkonditionie-rer zu testen und die erzielten Ergebnisse mit den Algorithmen des Finite Elemente Programmes Z88 zuvergleichen. Dies wird durch die Berechnung der Verschiebungen in unterschiedlichen Bauteilen mit mehrals 2 Millionen Freiheitsgraden umgesetzt. Durch Benutzung des hypre Frameworks stehen verschiedenehochparallele Loser und Vorkonditionierer zur Verfugung, z.B. BoomerAMG, Methode der konjugiertenGradienten oder ParaSails. Die Methoden werden bezuglich ihrer Losungszeiten, Anzahl der Iterationenund Residuen verglichen. Der mit ParaSails vorkonditionierte parallele Konjugierte Gradienten Algorith-mus ist die schnellste Kombination der Arbeit. Mit nur 8 CPUs wird es bis zu 110 Sekunden schneller alsdie Z88 Verfahren. Mit mehreren CPUs werden die Probleme bis zu 140 Sekunden schneller berechnet.Das Losen des grossten Bauteiles mit 243 CPUs ist sogar 22, 3 Minuten schneller. Im Bezug auf dieseWerte und die komplexeren Problemstellungen ist die Anwendung von parallelen Losern unentbehrlich.

Acknowledgements

I would like to thank Prof. Dr. Ulrich Rude and Prof. Dr.-Ing. Frank Rieg from the University ofBayreuth for offering me this thesis. Very special thanks to my supervisor Tobias Gradl for his overallsupport and advice. Also special thanks to my contact person in Bayreuth, Martin Neidnicht, who helpedme out with issues concerning Z88 and providing me with the data of the component parts. Additionally,I would like to thank Tobias Preclik, Dr. Harald Kostler and Klaus Iglberger for their overall support.

Contents

1 Introduction 1

2 Theoretical Background 32.1 Linear Elasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Finite Element Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 The Classical Algebraic Multigrid . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.2 The Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.3 ParaSails Preconditioner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 The Frameworks 153.1 The Hypre Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.1 Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1.2 Solvers and Preconditioner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Z88 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Computer Architecture 21

5 Results 235.1 Test Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Measurement Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.3 Z88 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.4 BoomerAMG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.5 Conjugate Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6 Conclusion 37

Bibliography 39

i

List of Figures

2.1 Deformation caused by external forces in 2D. . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Types of grids. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Basis functions in 1D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Example for the assembling routine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5 Types of cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.6 Example for the coarsening strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.7 Convergence of CG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Overview of the different interfaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Overview of the various solvers and preconditioners. . . . . . . . . . . . . . . . . . . . . . 183.3 Z88COM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1 Compute node of the Woodcrest Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.1 L section. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Piston. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.3 Fan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.4 Hub carrier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.5 Arch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.6 Connecting rod. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.7 I beam. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.8 Residual for solving the L section with Z88I2. . . . . . . . . . . . . . . . . . . . . . . . . . 295.9 Residual for solving the L section with BoomerAMG. . . . . . . . . . . . . . . . . . . . . . 305.10 Residual for solving the L section with CG. . . . . . . . . . . . . . . . . . . . . . . . . . . 315.11 Residual for solving the L section with CG and using Euclid, BoomerAMG or ParaSails as

a preconditioner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.12 Times and ratios for solving the component parts with CG+ParaSails and Z88I2. . . . . . 335.13 Ratios for solving the fan with multiple CPUs. . . . . . . . . . . . . . . . . . . . . . . . . 355.14 Ratios for solving the I beam with multiple CPUs. . . . . . . . . . . . . . . . . . . . . . . 35

6.1 Deformation of the fan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

iii

List of Tables

5.1 Times for calculating the displacements of the component parts with Z88. . . . . . . . . . 295.2 Times for calculating the displacements of the component parts with BoomerAMG. . . . . 305.3 Times for calculating the displacements of the component parts with CG. . . . . . . . . . 315.4 Times for calculating the displacements of the component parts with CG and a preconditioner. 325.5 Times and speedup for multiple CPUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

v

List of Algorithms

1 Multigrid MG(ul, fl, l) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Algebraic Multigrid Coarsening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 ParaSails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 Building an IJ Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Building an IJ Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Setup and run routine for the solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Setup and run routine for the solvers using a preconditioner . . . . . . . . . . . . . . . . . 19

9 Time measurement of Z88F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2810 Time measurement of Z88I2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2811 Time measurement of hypre solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

vii

Nomenclature

C − points points on the coarse grid

F si F-points strongly connected to i

H1 Sobolev space

Nwi all points weakly connected to i

Si set of points that strongly influence i

AMG algebraic multigrid

CG conjugate gradient

FEM finite element method

LE linear elasticity

MG multigrid

ix

1 Introduction

In modern times, linear elasticity (LE) analyses have become indispensable. The calculation of stresses,strains and displacements in mechanical components of a car are only a few among the various applicationfields of the elasticity theory. All these problems can be described by partial differential equations (PDEs).

One of the most used discretization methods over the last 50 years has been the finite element method(FEM) [JL01]. The basic idea of the method has been formulated by the engineers Turner, Clough, Martinand Topp. They described the approach by separating a solid body in a finite number of elements andcalculated the displacements at the element’s nodes. The theoretical proof was given by mathematiciansin the 60’s. According to the fact that FEM problems are not easy to solve in an analytical way, themethod has received a wide acceptance with the appearance of the first digital computers. One of thefirst applications was the trajectory calculation of the artillerie in World War 2. This was done with thehelp of Zuse Z3 and Havard Mark I, the first Turing machines [RH09]. At this time only people who hadaccess to a mainframe could deal with FE calculations.

Today, it is possible for nearly everybody to handle FE problems with the help of normal desktop PC’s.The increasing power of software and hardware over the last years lead to the possibility to calculatemore complicated and larger problems. With the aid of the FEM, a big spectrum of problems in differenttechnical application fields is covered. For instance the weather forecast, medical engineering or the mostclassical fields aircraft and vehicle construction. Furthermore, it provides the basis of every CAD program.

With having more complex local geometries, higher numbers of degrees of freedom or highly refined finiteelement meshes, the computational amount highly increases. This problem can be handled with the aidof parallel computers. Using these parallel systems in combination with more powerful solving methodsleads to a minimization of the computational costs. The equations can be solved very fast in an efficientway.

This thesis arised from the cooperation with Prof. Dr.-Ing. F. Rieg from the University of Bayreuth, whois the editor and designer of the finite element program Z88.Except for the PARDISO algorithm (in version 13), no parallel solver has yet been included in the Z88framework. Hence, the aim of the thesis was to test different parallel solvers and preconditioners andto compare them with the algorithms included in Z88 (of version 12). In all cases displacements werecalculated which were caused through surface-, pressure loads or external forces. In the scope of thisthesis the highly parallel hypre framework was used which provides a various number of parallel solversand preconditioners. The main focus lies on the comparison of the computation times spent by themethods of the two frameworks.

The structure of the thesis is arranged in the following way. Chapter 2 gives an overview of the theoreticalbackground. It starts with a short elucidation of LE. Afterwards, the FEM is described. Moreover, themain used solvers and preconditioners of the thesis are explained: algebraic multigrid (AMG), conjugategradient (CG) and ParaSails. Chapter 3 takes a closer look at the two applied frameworks: the freehypre library, which was developed at Lawrence Livermore National Laboratory [HYP09] can be used onmassively parallel computers, Z88 is a free finite element program and was developed at the Universityof Bayreuth [Z8809]. Chapter 4 gives a short overview of the used computer architecture, the WoodcrestCluster at the Regional Computing Center of Erlangen (RRZE), on which the problems were computed.Moreover, Chapter 5 presents the test components and the achieved results of the used methods. Finally,Chapter 6 concludes the thesis.

1

2 Theoretical Background

2.1 Linear Elasticity

This section only gives a small insight on LE, for a deeper understanding refer to [JL01].

In the beginning, displacements are elucidated. Due to heat generation or forces an undeformed body(Figure 2.1(a)) is changing its shape to a deformed body (Figure 2.1(b)).

(a) undeformed body (b) deformed body

Figure 2.1: Deformation caused by external forces in 2D.

There are two different possibilities to explain the deformation in 3D. The Lagrange description focuseson the movement of a particle X. The displacement u and location x are given by

x = x(X), u = u(X) alternatively xi = xi(Xj), ui = ui(Uj). (2.1)

The Euler description focuses on the state of a point in space. The particle’s location and displacementare given by

X = X(x), u = u(x) alternatively Xi = Xi(xj), ui = ui(xj), (2.2)

where i, j stand for the coordinate axes x, y, z. The displacements u are calculated with the help of thegradient of displacements

Hij = uij = 1/2(uij + uji) + 1/2(uij − uji) = εij + ωij , (2.3)

where the infinitesimal tensor of rotation ωij can be neglected since the rotations have no effect on theelement’s stress. The first term εij in equation (2.3) is the infinitesimal strain tensor

εij = 1/2(uij + uji), i, j ∈ x, y, z. (2.4)

The elements of the tensor are symmetric, i.e. εij = εji, and have the matrix notation

ε =

εxx εxy εxzεyx εyy εyzεzx εzy εzz

. (2.5)

3

Another important part in LE are the stresses

σ = [σij ] =

σxx σxy σxzσyx σyy σyzσzx σzy σzz

=

σ11 σ12 σ13

σ21 σ22 σ23

σ31 σ32 σ33

. (2.6)

In the case i = j the components σij are normal and are called normal stresses. If i 6= j the componentsσij are tangential and are called shear stresses.

With the aid of Hooke’s Law it is possible to combine strains and stresses to

σij = λεkkδij + 2µεij , for i, j ∈ x, y, z, (2.7)

where the Kronecker delta δij is defined as

δij =

1 for i = j,

0 for i 6= ji, j = 0, 1, ..., n. (2.8)

The parameters µ and λ are the Lame parameters and are defined as

λ =νE

(1 + ν)(1− 2ν), µ =

E

2(1 + ν), (2.9)

where ν and E are material constants.

In the scope of this thesis only the displacements are considered and calculated.

2.2 Finite Element Method

One of the standard approaches for solving LE problems is the FEM. Using the fact that the solutiondoes not have to fulfill the partial differential equation (PDE) in every point, the basis of the method isto create the weak or variational formulation (cf. [JL01]).

In the following, the transformation from classical to variational formulation is shown with the help of the1D heat equation. With a given temperature at the left boundary and a free heat exchange at the rightboundary, the equation is defined as

−u′′(x) + cu(x) =f(x) ∀x ∈ Ω = (a, b), ∀u(x) ∈ Rn, f(x) ∈ Rn,u(a) =ga,−u′(b) =αb(u(b)− ub).

(2.10)

Multiplying the equation with a test function v, contained in the Sobolev space H1(a, b) (cf. [JL01]) leadsto

− u′′(x) v(x) + cu(x) v(x) = f(x) v(x). (2.11)

Using partial integration and adding up the given boundary conditions results in

−b∫a

u′′(x) v(x) dx =

b∫a

u′(x) v′(x) dx+ ab u(b) v(b)− abubv(b). (2.12)

This is done for getting a symmetric stiffness matrix. Furthermore, the problem gains in accuracy sincethe partial integration minimizes the differential coefficients.

4

With the help of equation (2.11) and (2.12) the variational formulation is defined as

u ∈ Vg = u ∈ H1(a, b) : u(a) = ga,such that

a(u, v) = 〈F, v〉 ∀ v ∈ V0 = v ∈ H1(a, b) : v(a) = 0,where

a(u, v) =

b∫a

[u′(x) v′(x) + cu(x) v(x)] dx+ αb u(b) v(b),

〈F, v〉 =

b∫a

f(x) v(x) dx+ αbubv(b).

(2.13)

In the first step of solving the weak formulation, the domain is discretized by splitting it into differentelements. The discretization can be structured (Figure 2.2(a)) or unstructured Figure (2.2(b)). The meshis generated with the aid of different fundamental geometries. The ones most commonly used are

• 1D → interval,

• 2D → triangle, quadrangle,

• 3D → tetrahedron, pyramid, hexahedron.

(a) structured (b) unstructured

Figure 2.2: Types of grids.

Another important element of the FEM are the basis functions which approximate the problem at a finitenumber of points. These functions have to achieve the following conditions:

• the function has to be defined on the entire element,

• every function has to be assigned to one node of the element,

• at these nodes equation (2.8) has to be fulfilled,

• the sum of the approximation functions on an element has to be 1,

• the approximation functions in the nodes of elements with common edge or surface have to be thesame.

In the 1D case, the basis functions mainly used are: linear (Figure 2.3(a)), quadratic (Figure 2.3(b)) andcubic (Figure 2.3(c)). In the 2D case they are linear, bilinear, biquadratic and bicubic.

5

(a) linear (b) quadratic (c) cubic

Figure 2.3: Basis functions in 1D.

Combining equation (2.13) with the basis functions results in

uh ∈ Vgh = uh(x) : uh(x) =n∑j=1

ujpj(x) + gap0(x), (2.14)

where

a(uh, vh) = 〈F, vh〉 ∀vh ∈ V0h =

vh(x) : vh(x) =

n∑i=1

vipi(x)

. (2.15)

Now, it is possible to calculate the local stiffness matrices. After discretizing the domain with the help ofn fundamental geometries the left hand side of equation (2.14) is computed to

n∑i−1

xi∫xi−1

[u′h(x) v′h(x) + cuh(x) vh(x)] dx. (2.16)

Regrouping the equation results inn∑i−1

(vi−1 vi) K(i)

(ui−1

ui

), (2.17)

where

K(i) =

(K

(i)11 K

(i)12

K(i)21 K

(i)22

)(2.18)

is the local stiffness matrix of one of the fundamental geometries. K(i) has the following composition:

K(i)11 =

xi∫xi−1

[p′i−1(x p′i−1(x) + c pi−1(x) pi−1(x)] dx,

K(i)12 =

xi∫xi−1

[p′i(x) p′i−1(x) + c pi(x) pi−1(x)] dx,

K(i)21 =

xi∫xi−1

[p′i−1(x) p′i(x) + c pi−1(x) pi(x)] dx,

K(i)22 =

xi∫xi−1

[p′i(x) p′i(x) + c pi(x) pi(x)] dx.

(2.19)

6

Applying the same concept to the right hand side yields

n∑i=1

xi∫xi−1

f(x) vh(x) dx =n∑i−1

(vi−1 vi)f (i) (2.20)

with

f (i) =(f

(i)1

f(i)2

)=

(∫ xi

xi−1f(x) pi−1(x) dx∫ xi

xi−1f(x) pi(x) dx

). (2.21)

Assembling the local stiffness matrices to the global stiffness matrix A ∈ Rnxn leads to the equationsystem

Au = f. (2.22)

The assembling routine is shown with the help of a small example. According to the problem given inFigure 2.4, the global stiffness matrix results in

A =

K1

11 K112

K121 K1

22 +K222 K2

23

K232 K2

33 +K333 K3

34

K343 K3

44

, (2.23)

where e.g. the local stiffness matrix of element 1 is

K1 =[K1

11 K112

K121 K1

22

]. (2.24)

Combining the nodes 1 and 2, the element matrix entries defined by equation (2.24) are added to theglobal matrix in the following way. K1

11 is added to position A(1, 1), K112 to A(1, 2), K1

21 to A(2, 1)1 andK22 to A(2, 2).

Figure 2.4: Example for the assembling routine.

2.3 Solvers

In this section, the used iterative methods and preconditioners for solving equation (2.22) are explained.In the beginning, the focus lies on the classical AMG. This method only requires the underlying matrix,and no knowledge of the geometry of the problem is needed. Moreover, the section takes a look at anotherwell-known solver, the CG method. In the end, the ParaSails preconditioner is elucidated, which is aparallel sparse approximate inverse preconditioner.

2.3.1 The Classical Algebraic Multigrid

The basic modules in every multigrid (MG) method are the pre-smoothing, post-smoothing, restriction,prolongation and correction step. The following elucidation is mainly based on [WBM99] and [Fal06].

7

Algorithm 1 Multigrid MG(ul, fl, l)

1 if l = 1 then2 MG(ul, fl, l) = A−1

l fl3 end if4 if l > 1 then5 ul = Rν1l,fl

(ul)6 // Coarse grid correction7 Residual: rl = fl −Alul8 Restriction: rl−1 = I l−1

l rl9 Recursive call:

10 e0l−1 = 011 for i = 1 to µ do12 eil−1 = MG(ei−1

l−1, rl−1, l − 1)13 el−1 = eµl−1

14 end for15 Prolongation: el = I ll+1el−1

16 Correction: ul = ul + el17 // ν2-post-smoothing18 MG(ul, fl, l) = Rν2l,fl

(ul)19 end if

Algorithm 1 shows the standard multigrid algorithm, where l is the number of levels, I l−1l is the interpo-

lation operator, I ll+1 is the prolongation operator and R is the smoothing operator. AMG does not needany geometries but in order to keep the explanation as simple as possible it is nevertheless elucidated withthe help of geometry.

As mentioned before, AMG is based on the same concept as the classical MG. It removes recursively theremaining smooth error after relaxation with the help of coarse grid correction.Both methods are recursive, specified by the parameter µ. In the case of µ = 1 a V-cycle is executed.Figure 2.5(a) shows a 5 level V-cycle. If µ = 2, a W-cycle is performed. Figure 2.5(b) depicts a 4 levelW-cycle.

(a) V-cycle (b) W-cycle

Figure 2.5: Types of cycles.

Nevertheless, there are some differences among the classical MG and AMG. In the classical MG the high

8

frequency geometric error is damped with the aid of the smoothing operator. The remained low frequencyerror (smooth error), which is eliminated by coarse grid correction, is smooth in the usual geometric sense.In the case of AMG, where linear systems are solved on the basis of MG principles the remaining smootherror can be geometrically oscillatory. (cf. [Fal06])

In AMG, one of the crucial points is defining the coarse grid operator. In order to get a good coarseningstrategy, the nature of the smooth error is needed, which is the underlying basis of the main components ofthe method. Using the fact that only the matrix A is known, but not the geometry leading to the structureof A, the error has to be characterized in an algebraic way. Due to the attribute, that the iterative methodsGauss-Seidel or Jacobi damps high frequencies in a good way, they are often used as relaxation schemes.The smooth error corresponds to the eigenvectors of A with small eigenvalues. Concerning the fact, thatthe smoothers eliminate the big eigenmodes, the coarse grid correction has to reduce the small modes.

It is necessary to split the fine grid into C-points (within the fine and coarse grid) and F-points (onlyin the fine grid) to select the coarse grid. An important factor of this operation is the Strength ofConnection˝ among 2 neighboring entries in the matrix A, measured with a threshold Θ. Using

− aij ≥ Θ maxk 6=i−aik with 0 < Θ ≤ 1, (2.25)

the coarse grid selection can be merged into two heuristics (cf. [WBM99]):

H-1: For each F-point i, every point j ∈ Si that strongly influences i either should be in thecoarse interpolatory set Ci or should strongly depend on at least one point in Ci.

H-2: The set of coarse points C should be a maximal subset of all points with the propertythat no C-point strongly depends on another C-point.

Figure 2.6 shows the relevant operations of a three step instruction given in Algorithm 2 for choosing thecoarse grid.

Algorithm 2 Algebraic Multigrid Coarsening

1 Selecting a C-point with maximal weight.2 Defining the neighbor points as F-points.3 All neighbor points of the defined F-points are updated.

The used discretization stencil in Figure 2.6 is,−1 −1 −1−1 8 −1−1 −1 −1

. (2.26)

The next module of the AMG is the interpolation. Since small eigenmodes indicate a smooth error, theprolongation uses the formula r = Ae to show the coherence between smooth error and small residuals

rT r = eTA2e = λ2 < 1. (2.27)

Circumscribing equation (2.27) at a F-point i results in

aiiei = −∑j∈Ci

aijej −∑j∈F s

i

aijej −∑j∈Nw

i

aijej , (2.28)

where Ci are the C-points strongly connected to i, F si are the F-points strongly connected to i and Nwi

are all points weakly connected to i.

Exchanging the ej ’s in the last two sums with Ci or the F-point i leads to the definition of interpolation.

9

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

(j) (k) (l)

Figure 2.6: Example for the coarsening strategy.

10

2.3.2 The Conjugate Gradient Method

The CG algorithm is a widely used iterative method for solving linear equations. In this section, only thefundamentals of the algorithm are explained. Those willing and thirsty for knowledge can have furtherrecourse to appropriate technical literature (cf. [She94]).

The major idea of CG is to find the minimum of the quadratic form

f(x) = 0.5xTAx− bTx+ c, (2.29)

instead of solving the linear system Ax = b. For this purpose A-orthogonal search directions are used.This is shown in Algorithm 3.

Algorithm 3 Conjugate Gradient Method

1 d(0) = r(0) = b−Ax(0)

2 for number of iterations do

3 αi =rT(i)r(i)

dT(i)Ad(i)

4 x(i+1)=x(i)+α(i)d(i)

5 r(i+1) = r(i) − α(i)Ad(i)

6 β(i+1) =rT(i+1)r(i+1)

rT(i)r(i)

7 d(i+1) = r(i+1) + β(i+1)d(i)

8 end for

The parameter α specifies the length of each step. The aim is to go in every search direction only once.Hence, the first direction is chosen along the coordinate axis. This implies the orthogonality of e(i+1) tod(i) and results in a first approximation

α(i) = −dT(i)e(i)

dT(i)d(i)

. (2.30)

Since e(i) is not known, a new approach is needed. Considering additionally the A-orthogonality of twosearch vectors d(i), d(i+1) and the residual equation ri = −Aei = −∇f(xi) which is the negative gradient,results in the final formulation

α(i) =dT(i)r(i)

dT(i)Ad(i)

. (2.31)

In the last two lines of the algorithm, the new search direction d(i+1) is calculated. As a starting pointthe Gram-Schmidt-Conjugation is chosen to

d(i) = ui +i−1∑k=0

βikd(k). (2.32)

Combining the equation with the residual, by setting r(i) = u(i) leads to

d(i+1) = r(i+1) + β(i+1)d(i). (2.33)

The residual is orthogonal to the previous search direction, i.e.

dT(i)r(j) = 0, i < j. (2.34)

This impliesrT(i)r(j) = 0, i 6= j. (2.35)

11

Using theses two terms β is obtained to

β(i+1) =rT(i+1)r(i+1)

rT(i)r(i). (2.36)

Since the A-orthogonal search directions and the exact length α of each step, the CG method convergesin a maximum of n steps. This is shown in Figure 2.7.

Figure 2.7: Convergence of CG (cf. [She94]).

2.3.3 ParaSails Preconditioner

The hypre package contains several preconditioners. In the scope of this thesis, four of them have beenanalyzed and compared:

• BoomerAMG : a parallel algebraic multigrid method,

• Euclid : Implementation of the parallel ILU algorithm,

• PILUT : a parallel incomplete factorization,

• ParaSails: a sparse approximate preconditioner.

The best solutions of the problems of this thesis were achieved with ParaSails. It is used to approximatethe inverse of a matrix A by a sparse matrix M. This is done with the help of the Cholesky factor L andby minimizing

‖ I−ML ‖2F (2.37)

in the Frobenius norm

‖ A ‖F=

√√√√ m∑i=1

n∑j=1

a2ij . (2.38)

This elucidation is based on [Cho00].

12

In the case of symmetric positive definite (spd) problems a sparse lower triangular matrix G is chosen toapproximate A−1:

GTG ≈ A−1. (2.39)

This is done by minimizing‖ I−GL ‖2F . (2.40)

Due to the fact, that all problems of this thesis are spd-problems, only the factorized case is considered.For other possibilities refer to [Cho00].

Algorithm 4 ParaSails

1 Threshold A to produce A.2 Compute the pattern A

Land let the pattern of G be the lower triangular part of the pattern of A

L.

3 Compute the nonzero entries in G.4 Filtration: drop small entries in G and rescale.

In the following, the single steps of the ParaSails method (Algorithm 4) are explained. In a first step, thebinary matrix A is calculated out of A. It is defined as follows:

A =

1, if i = j or | (D−1/2AD−1/2)ij |> thresh

0, otherwise(2.41)

with

D =

| Aii |, if | Aii |> 01, otherwise

(2.42)

where thresh is the first parameter of the ParaSails implementation. When choosing a smaller value forthresh, A is thinned out and produces a more accurate preconditioning matrix.

Using matrix (2.41), the pattern of AL

is calculated by merging and storing the sparse rows in a denseformat. The exponent L is defined with the aid of the parameter nlevels within the hypre code

L = nlevels + 1. (2.43)

In the case of nlevels = 0 and thresh = 0, the sparsity pattern of G is the same as in A.

The nonzero entries are computed with the help of normal equations and equation (2.40)

GLLT ij = LT ij , (i, j) ∈ SL, (2.44)

where L is the Cholesky factor. Some reformulations lead to the preconditioned matrix

GAGT . (2.45)

The filtration step is used to reduce the costs of the the preconditioner. All matrix entries of G, whichare smaller than the filter parameter are dropped out.

With the aid of the three mentioned transfer parameters of the ParaSails implementation thresh, nlevelsand filter it is possible to adjust the accuracy and cost of the preconditioner. Since the special form ofthe Frobenius norm (2.38) the algorithm is inherently parallel.

13

3 The Frameworks

In this chapter, the used frameworks for solving the linear equation (2.22) are introduced and explained.The chapter is based on [HYP06b], [HYP06a], [FY02] and [Rie08].

In the beginning, hypre, a software library of high performance preconditioners and solvers, is explained.It was developed at the Center for Applied Scientific Computing at Lawrence Livermore National Labo-ratory. The used version is 2.0.0 of the year 2006.

In the second part of the chapter, the finite element program Z88 is elucidated. Prof. Dr. Ing. F. Rieg ofthe University of Bayreuth is the designer and editor. In the scope of this thesis version 12.0.0 has beenused.

3.1 The Hypre Package

The hypre software package is mainly written in C and provides interfaces for other programming languageswith the help of Babel. For more information on Babel refer to [Bab04].

The framework was developed to solve large, sparse linear systems of equations on massively parallelcomputers. For parallelization the package is using MPI, which is based on a distributed memory concept.For more information on MPI consider [MPI09].

It includes several families of scalable preconditioners, for instance the already mentioned ParaSails, Euclidor PILUT algorithm. Due to the fact that the most commonly used Krylov-based iterative methods arepart of the framework, it is possible to solve a big spectrum of problems. GMRES is applied in the caseof nonsymmetric problems and CG for symmetric matrices. Both are only two among the various solvers.For setting up the sparse matrix data structure, hypre uses different types of interfaces: a stencil basedstructured (Struct)/semi-structured (SStruct) interface, a finite element based interface (FEI) and a linearalgebraic interface (IJ).

3.1.1 Interfaces

One of the most important points of using hypre is choosing the correct interface. Figure 3.1 (cf. [HYP06b])illustrates an overview of the different interfaces. The Struct interface is used for solving finite differenceor finite volume problems. Therefore a fixed stencil and a structured rectangular grid is needed.

For solving block structured or composite grids the SStruct interface is the right decision. It supportsmultiple unknowns per cell and uses a graph to allow nearly arbitrary relationships between parts of thedata.The FEI interface is used for solving linear equations out of a finite element discretization, by offering a setof finite element data structures. Concerning the fact that the matrix A has already included boundaryvalues and element informations. Splitting up the finalized matrix A into the finite element structurewould be too costly. Therefore, the last interface, the IJ interface was used to solve the displacements ofthe component parts.

15

Figure 3.1: Overview of the different interfaces.

According to the fact that hypre was built for using a high number of processors the matrix of the linearsystem in equation (2.22) has to be distributed as

A =

A0

A1

...Ap−1

, (3.1)

where p is the number of processors and Ai is itself a matrix of a certain number of rows given by ilowerand iupper.

Algorithm 5 Building an IJ Matrix1 ...2 HYPRE_IJMatrixCreate(comm, ilower, iupper, jlower, jupper, &ij_matrix);3 HYPRE_IJMatrixSetObjectType(ij_matrix, HYPRE_PARCSR);4 HYPRE_IJMatrixInitialize(ij_matrix);5

6 HYPRE_IJMatrixSetValues(ij_matrix, nrows, ncols, rows, cols, values);7

8 ...9

10 HYPRE_IJMatrixAssemble(ij_matrix);11 HYPRE_IJMatrixGetObject(ij_matrix, (void**) &parcsr_matrix);12 ...

The code snippet in Algorithm 5 shows the standard implementation of an IJ matrix. In the beginning, anew matrix is built with the Create() routine, the rows are split across the processors shown in equation(3.1).With the help of the SetObjectType() function the matrix object type is set to HYPRE_PARCSR. This is asparse matrix storage format. It needs three vectors containing the nonzero values (values), the columnindices (cols) and a vector storing the number of nonzero values per row (ncols). A short example of

16

the PARCSR storage format is illustrated in the following. The matrix given by

7 0 6 0 0 5 04 0 0 2 0 0 00 0 0 0 0 1 00 0 3 0 0 0 00 8 0 0 9 0 170 0 34 0 0 14 00 0 0 26 0 0 10

(3.2)

can be written as

values [ 7 , 6 , 5 , 4 , 2 , 1 , 3 , 8 , 9 , 17, 34, 14, 26, 10 ]cols [ 0 , 2 , 5 , 0 , 3 , 5 , 2 , 1 , 4 , 6 , 2 , 5 , 3 , 6 ]ncols [ 3 , 2 , 1 , 1 , 3 , 2 , 2 ].

(3.3)

Calling the Initialize() routine shows that the matrix is ready to be set. This is done by adding thecoefficients with the help of the SetValues() function. The parameters values,cols and ncols specifythe ParCSR vectors. Storing the values in the ij_matrix object, the function needs the row vector,containing the row indices and the nrows variable, specifying the number of rows to be set. After finalizingthe matrix with the Assemble() function and calling the GetObject() routine, the ParCSR_matrix canbe passed to a hypre solver.

The setup of the right hand side and of the solution vector is done with the IJ vector interface. Theapproach is the same as in the matrix case and is shown in the code snippet of Algorithm 6.

Algorithm 6 Building an IJ Vector1 ...2 HYPRE_IJVectorCreate(comm, jlower, jupper, &ij_vector);3 HYPRE_IJVectorSetObjectType(ij_vector, HYPRE_PARCSR);4 HYPRE_IJVectorInitialize(ij_vector);5

6 HYPRE_IJVectorSetValues(ij_vector, nvalues, indices, values);7

8 ...9

10 HYPRE_IJVectorAssemble(ij_vector);11 HYPRE_IJVectorGetObject(ij_vector, (void**) &parcsr_vector);12 ...

3.1.2 Solvers and Preconditioner

As mentioned before, hypre includes several solvers and preconditioners. Figure 3.2 shows the methodsused in the thesis and their dependencies on the interfaces.

17

Figure 3.2: Overview of the various solvers and preconditioners.

The basis of the setup and run routine for the various hypre solvers is nearly identical. The only differenceis the setting of the parameters. This is illustrated in Algorithm 7.

Algorithm 7 Setup and run routine for the solvers1 ...2 // Create Solver3 int HYPRE_SOLVERCreate(MPI_COMM_WORLD, &solver);4

5 // set certain parameters6 HYPRE_SOLVERSetTol(solver,1.e-7);7 ...8 // set up solver9 HYPRE_SOLVERSetup(solver,A,b,x);

10 // solve the system11 HYPRE_SOLVERSolve(solver,A,b,x)12 // Destroy the solver13 HYPRE_SOLVERDestroy(solver);14 ...

The implementation of a preconditioner has to be set before the solver routines (shown in Algorithm 8).

18

Algorithm 8 Setup and run routine for the solvers using a preconditioner1 ...2 // Set up the preconditioner3 HYPRE_PRECONDCreate(MPI_COMM_WORLD, &precond_solver);4

5 // Optional fine-tuning of the preconditioner6 ...7

8 // Set up the used solver9 HYPRE_SOLVERCreate(MPI_COMM_WORLD, &solver);

10

11 // Optional fine-tuning of the solver12 ...13

14 // Initialize the preconditioner15 HYPRE_SOLVERSetPrecond(solver,HYPRE_PRECONDSolve,HYPRE_PRECONDSetup,precond_solver)16

17 HYPRE_SOLVERSetup(solver,A,b,x);18

19 // Solve the system20 HYPRE_SOLVERSolve(solver,A,b,x)21

22 // Destroy the solver23 HYPRE_SOLVERDestroy(solver);24 HYPRE_PRECONDDestroy(precond_solver);25 ...

3.2 Z88

The second framework used in the thesis is the finite element method program Z88. It was developedand designed especially for PCs and is subdivided into modules. Z88 is compatible with Windows/Unixplatforms and is capable of exchanging data to CAD systems, e.g. Pro/ENGINEER. The frameworkcovers 20 different element types in total and it is possible to calculate plane stress, plate bending, axialsymmetric structures and spacial structures up to 20 node Serendipity hexahedrons.

Z88 consists of different modules which can be put in running order separately. After finishing a job, themodules free the allocated memory. The communication in between the functions is handled via in andoutput data sets. A graphical user interface (GUI) of the framework’s modules is given by Z88COM, whereall operations can be started and controlled. Figure 3.3 (cf. [Rie08]) shows the GUI for the Windows andUnix Commander.The basis of the framework are the solvers. At the moment Z88 is supporting three different solvers forcalculating the FE problems.

Z88F, a direct Cholesky solver without fill-in is used for solving small to average size structures up to30000 degrees of freedom.

The PARDISO solver was developed by O. Schenk at the University of Basel and is currently the onlyparallel solver in the framework using up to 9 CPUs. Using a direct decomposition with fill-in makesit possible to calculate medium size structures up to 150000 degrees of freedom. The huge amount ofmemory requirement and the limitation of CPUs in Z88PAR are two of the disadvantages of the method.

Z88I2 is a sparse matrix solver which is based on the CG algorithm. Due to the fact that the iterative

19

(a) Windows (b) Unix

Figure 3.3: Z88COM.

method is preconditioned with either a SOR or a Cholesky decomposition, fast calculations of FE structuresup to 5 million degrees of freedom are no problem. The execution of the method is split into two phases.In a first step Z88I1 builds up the structure of the global stiffness matrix. The second step Z88I2 iscalculating the local stiffness matrices, assembling them to the global matrix and is solving the system.

Another important part about Z88 is the possibility of creating Z88-input files out of CAD data andotherway round. This is done with the modules Z88X and the 3D converter Z88G.Joint forces and strains are calculated with the help of Z88D and Z88E. Furthermore, the frameworkincludes a net generator, a filechecker and a own plotting program.

According to the fact that in Version 12 the PARDISO solver was not yet included, only the Z88F andZ88I1/I2 solvers were used in the thesis.

20

4 Computer Architecture

The computer architecture used in this thesis is the Woodcrest Cluster (termed woody) [oE09] at theRegional Computing Center of Erlangen (RRZE).

Woody is a high-performance cluster with 217 nodes. On each node (Figure 4.1) are two ”Xeon 5160”Woodcrest Chips with 4 cores and 8GB of RAM. The clock speed of the cores is 3.00GHz and they arerunning with a 32kB Level 1 cache and a 4MB unified Level 2 cache.

Figure 4.1: Compute node of the Woodcrest Cluster.

The infiniband network of the cluster has a 10 GBit/s bandwidth per link and direction. Furthermore,woody has a peak performance of 12 GFlops/s per processor and a overall peak performance of 10.4TFlops/s. The performance, measured with LINPACK, is 6.62 TFlops/s.

In November 2006 the system has entered the Top 500 list on rank 124 and was still on rank 329 inNovember 2007.

21

5 Results

This chapter gives an overview of the achieved results from the calculations of the displacements. More-over, the chapter compares the different solvers and preconditioners with respect to time and number ofiterations. In the beginning, the used component parts with their configurations are introduced. After-wards, the results of the solvers are presented beginning with the Z88 methods, followed by BoomerAMGand CG. The chapter is concluded with a summary of all results.

5.1 Test Problems

This section gives a short overview of the component parts, on which the displacements were calculated.For all seven problems the boundary condition, number of nodes, number of elements, number of degreesof freedom, type of forces and the element type are mentioned.

(a) light view (b) grid view

Figure 5.1: L section.

L section (Figure (5.1))

• 3D

• Dirichlet boundary conditions

• 1758 nodes

• 6535 elements

• 5274 degrees of freedom

• surface pressure loads

• type of element

– tetrahedrons

– linear basis functions

– size of stiffness matrix 12x12

23


Figure 5.2: Piston.

Piston (Figure (5.2))

• 3D


• 32522 nodes

• 129569 elements



• type of element– tetrahedrons– linear basis functions– size of stiffness matrix 12x12


Figure 5.3: Fan.

24

Fan (Figure (5.3))

• 3D


• 34495 nodes

• 130541 elements



• type of element

– tetrahedrons




Figure 5.4: Hub carrier.

Hub carrier (Figure (5.4))

• 3D


• 13392 nodes

• 58794 elements



• type of element

– tetrahedrons



25


Figure 5.5: Arch.

Arch (Figure (5.5))

• 3D


• 31431 nodes

• 18983 elements



• type of element

– tetrahedrons

– quadratic isoparametric serendipity elements



Figure 5.6: Connecting rod.

26

Connecting rod (Figure (5.6))

• 3D


• 35751 nodes

• 19622 elements


• external forces

• type of element

– tetrahedrons

– quadratic isoparametric serendipity elements



Figure 5.7: I beam.

I beam (Figure (5.7))

• 3D


• 753474 nodes

• 4151839 elements


• external forces

• type of element

– tetrahedron



27

5.2 Measurement Methods

Measuring of the running time is done with gettimeofday() which measures up to microseconds. Theroutine is executed within the functions timer_start() and timer_lap(), which returns the time inseconds since the last call of gettimeofday(). According to the fact that the solvers are in the focus ofinterest, only the methods themself are measured. The code snippets in Algorithm 9, 10 and 11 show theimplementation of the time measurement for Z88F in file z88cc.c, Z88I2 in file z88ci.c and for the hypresolvers, respectively.

Algorithm 9 Time measurement of Z88F1 ...2 timer start();3 choy88();4 printf (”Time for solving: %e”, timer lap ());5 ...

Algorithm 10 Time measurement of Z88I21 ...2 timer start();3 sorcg88();4 printf (”Time for solving: %e”, timer lap ());5 ...

Algorithm 11 Time measurement of hypre solvers1 ...2 timer start();3 HYPRE_SOLVERSolve(solver,A,b,x);4 printf (”Time for solving: %e”, timer lap ());5 ...

All chosen solvers in the thesis have the same stopping criterion, i.e, the residual is smaller than 10−7.

5.3 Z88

Table 5.1 shows the clock time in seconds and the number of iterations for the Z88 solvers. According tothe large number of degrees of freedom of the problems, the Cholesky solver Z88F could only handle theL section.Z88I2, the iterative CG method, was preconditioned with SOR which used a relaxation parameter ω = 1(Gauß-Seidel). Referring to Section 5.1, the fan is one of the most complicated component parts tested inthe thesis. Z88I2 needs 152 seconds and 2322 iterations to solve it.

28

Test problemsZ88F Z88I2

time iter time iterL section 37.42 5274 0.41 224

Hub carrier / / 8.82 314Piston / / 18.93 235Arch / / 47.72 383Fan / / 152.36 2322

Connecting rod / / 134.60 987I beam / / 1480.63 589

Table 5.1: Times for calculating the displacements of the component parts with Z88.

Figure 5.8 illustrates the relative residual over the iterations for the L section, solved by Z88I2.

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 50 100 150 200

rela

tive

resi

dual

Number of iterations

Z88

Figure 5.8: Residual for solving the L section with Z88I2.

5.4 BoomerAMG

This section presents the achieved results for BoomerAMG, used as a solver. The following optional pa-rameters of the method were chosen, the strength threshold, explained in equation (2.25), was set to0.73 and the maximum number of cycles to 20. Furthermore, the Ruge3 coarsening and the F-F interpo-lation were chosen, for more information on these algorithms refer to [HYP06b]. If BoomerAMG is usedas a solver, the Table 5.2 shows the solving time and the number of iterations of the component parts.The calculation of the three largest problems, the fan, the connecting rod and the I beam, were stoppedafter more than 12.5 hours and more than 70000 iterations.

29

Test problemsBoomerAMG

time iterL section 35.86 6423

Hub carrier 523.39 4311Piston 842.51 2799Arch 7864.2 15038Fan > 12.5h > 136532

Connecting rod > 12.5h > 76000I beam > 12.5h > 2140

Table 5.2: Times for calculating the displacements of the component parts with BoomerAMG.

The bad results for AMG are not unexpected. According to [TO00] and [MGS03] several problems arisewith MG solving LE problems, e.g. concerning the rigid body modes and the near null space.

In Figure 5.9 the residual over the iterations for solving the L section is illustrated. The displacementshave been calculated with the help of BoomerAMG used as a solver.

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 1000 2000 3000 4000 5000 6000

rela

tive

resi

dual

number of iterations

BoomerAMG

Figure 5.9: Residual for solving the L section with BoomerAMG.

5.5 Conjugate Gradient

The last tested method of the thesis is CG. It was used as a single solver and in combination with differentpreconditioners, e.g. the above mentioned BoomerAMG, Euclid, PILUT or ParaSails. CG using PILUTas a preconditioner diverges for every used problem.

In Table 5.3 the times and number of iterations for the plain CG solver are presented. Noticeable are thelarge number of iterations, which cause the higher calculation times than Z88I2.

30

Test problemsCG

time iterL section 0.32 922

Hub carrier 46.30 6584Piston 70.89 3773Arch 76.20 2486Fan 362.46 20655

Connecting rod 563.95 15477I beam 9033.94 10869

Table 5.3: Times for calculating the displacements of the component parts with CG.

Using the CG method for calculating the displacements of the L section, Figure 5.10 illustrates theoccurring residuals over the iterations.

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 100 200 300 400 500 600 700 800 900

rela

tive

resi

dual


CG

Figure 5.10: Residual for solving the L section with CG.

The next part of the section points out the results for the preconditioners BoomerAMG, Euclid andParaSails in combination with CG. In the case of BoomerAMG, the parameters were set in the same wayas for BoomerAMG used as a plain solver. The strength threshold was set to 0.73, Ruge3 was set ascoarsening strategy and the F-F interpolation was chosen.

Table 5.4 shows the values for all component parts. It was not possible to solve the displacements of thefan and the piston in combination with the Euclid preconditioner. The calculation was numerically notstable.

31

Test problemsCG+BoomerAMG CG+Euclid CG+ParaSails

time iter time iter time iterL section 1.87 23 0.22 85 0.16 228

Hub carrier 39.91 24 6.97 204 5.63 567Piston 91.91 23 / / 12.05 455Arch 234.276 35 31.39 122 23.97 591Fan 1234.87 319 / / 106.78 3617

Connecting rod 854.50 120 88.32 289 46.36 1002I beam > 4h > 56 4643.76 1445 5183.85 3971

Table 5.4: Times for calculating the displacements of the component parts with CG and a preconditioner.

Figure 5.11, illustrates once more the residuals over the iterations for the L section. This time, the CGmethod is used with different preconditioners.

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 50 100 150 200

rela

tive

resi

dual


CG+ParaSailsCG+Euclid

CG+BoomerAMG

Figure 5.11: Residual for solving the L section with CG and using Euclid, BoomerAMG or ParaSails as apreconditioner.

Furthermore, Figure 5.12 illustrates the exact times and the time ratios (thypre/tZ88I2), for solving thedisplacements, between the CG method, preconditioned with ParaSails and Z88I2. The time differencesfor the smaller problems are only marginal, e.g. the calculation of the displacements for the hub carrieris only 3 seconds faster than Z88I2. Using these parallel methods is more profitable by calculating largerproblems, e.g. the fan is solved 46 seconds faster or the connecting rod which is solved 88 seconds fasterwith the combination of CG and ParaSails. The only exception of the component parts is the I beam.This problem is solved 61 minutes faster with the aid of Z88I2 than with the hypre methods.

32

0.01

0.1

1

10

100

L section Hub carrier Piston Arch Connecting rod Fan 0

0.2

0.4

0.6

0.8

1

time

log

scal

ed

rang

e of

t(hy

pre)

/t(Z

88)

Component Parts

Z88hypre

t(hypre)/t(Z88)

Figure 5.12: Times and ratios for solving the component parts with CG+ParaSails and Z88I2.

Now, the section is concluded with the results for using multiple CPUs. Table 5.5 shows the calculationtimes and the speedup for the different component parts using up to 8 CPUs.

33

(a)

L section#CPU time speedup

1 0.16 12 0.14 1.153 0.08 1.884 0.07 2.285 0.07 2.386 0.06 2.497 0.07 2.368 0.08 1.88

(b)

Hub carrier#CPU time speedup

1 5.63 12 4.08 1.373 3.90 1.444 3.70 1.525 3.28 1.716 2.81 2.007 2.79 2.028 2.65 2.12

(c)

Arch#CPU time speedup

1 23.97 12 15.70 1.533 16.12 1.494 15.67 1.535 12.68 1.896 11.18 2.157 10.99 2.188 10.64 2.25

(d)

Piston#CPU time speedup

1 12.05 12 9.03 1.333 8.19 1.474 8.15 1.485 6.15 1.966 6.48 1.867 5.65 2.138 5.79 2.08

(e)

Fan#CPU time speedup

1 106.78 12 76.72 1.403 79.49 1.354 67.04 1.605 49.36 2.176 43.94 2.447 46.30 2.328 43.86 2.44

(f)

Connecting rod#CPU time speedup

1 46.36 12 36.02 1.373 35.33 1.404 32.55 1.525 28.84 1.716 24.94 1.987 23.47 2.108 23.17 2.13

(g)

I beam#CPU time speedup

1 5183.85 12 3838.49 1.353 3667.69 1.414 3225.62 1.605 2480.51 2.096 1955.41 2.657 1918.76 2.708 1833.24 2.83

Table 5.5: Times and speedup for multiple CPUs.

Figure 5.13 and Figure 5.14 illustrate the time ratios (thypre/tZ88) for solving the fan and the I beam withmultiple CPUs, respectively.

34

0

0.2

0.4

0.6

0.8

1

20 40 60 80 100 120

rang

e of

t(hy

pre)

/t/(Z

88)

number of CPUs

t(hypre)/t/(Z88)

Figure 5.13: Ratios for solving the fan with multiple CPUs.

0.5

1

1.5

2

2.5

3

3.5

50 100 150 200 250

rang

e of

t(hy

pre)

/t(Z

88)

number of CPUs

t(hypre)/t(Z88)

Figure 5.14: Ratios for solving the I beam with multiple CPUs.

35

5.6 Conclusions

For all described problems in Section 5.1, the displacements caused by surface loads, pressure loads orexternal forces have been calculated. Therefore a various number of solvers and preconditioners have beentested and compared.

The direct Cholesky solver Z88F is only useful for small to medium sized problems. From the componentparts, only the L section with 5274 degrees of freedom could be calculated. All other problems lead tosegmentation faults and stopped the calculation. The iterative solver Z88I2 is the most used and mostrobust method within the Z88 framework. All mentioned problems have been solved in a fast and accurateway. For instance, Table 5.1 shows that the calculation of the displacements for the L section was about37 seconds faster than with Z88F.

The worst results were achieved with BoomerAMG. Even the smallest problems take 80 times longerthan the fastest solution with CG and ParaSails. Solving the three largest problems was stopped aftermore than 12.5 hours (the calculation of the fan was terminated after 136532 iteration steps at a relativeresidual of 2 · 10−2).According to Table 5.2, the solution of the L section takes over 6000 iterations with an average convergencefactor of 0.9. Using BoomerAMG as a preconditioner is not improving the calculation speed. Solving theequations for the fan in combination with CG is about 800 seconds slower than using the plain CG method,shown in the Tables 5.3 and 5.4.As mentioned before, these bad results are not unexpected. [TO00] and [MGS03] provide more details onthe problems regarding LE and MG.

The best results were achieved with the combination of the CG method and ParaSails used as a precon-ditioner. Both methods are from the highly parallel hypre framework. Even with only 1 CPU the hyprecombination is much faster than the Z88 solvers, except for the I beam which has over 2 million degreesof freedom. This problem was solved slower than with Z88I2. Comparing the smaller and more simpleproblems, shown in the Tables 5.1 and 5.4, the time differences are only marginal. For example the Lsection is calculated 0.25 seconds faster or the piston which is calculated 6.88 seconds faster than withZ88I2. With having more complex component parts the differences increase, which is also shown in theabove mentioned Tables. For instance Z88I2 is about 46 seconds slower for calculating the displacementsof the fan or 88 seconds slower for calculating the connecting rod than the CG and ParaSails combination.These results are also illustrated in Figure 5.12, where the time differences of the frameworks are compareddirectly.

The performance of the hypre solvers can be increased considerable by using multiple CPUs. Table 5.5shows the times and speedup values for the different problems by solving them with up to 8 CPUs. Evenwith just 6 CPUs, a doubling of the speed for all component parts can be achieved. For instance, solvingthe displacement of the fan, using CG and ParaSails with 8 CPUs is 143% faster than the calculationwith 1 CPU and even 247% faster than Z88I2. The fan was also calculated with up to 120 CPUs. Figure5.13 show that the best results were achieved with 30 − 45 processors. When using more than 50 CPUsthe communication between the processors is too high and the solving times are again slowly increasing.With the help of multiple CPUs the solving time of the largest problem, the I beam, can also be increasedconsiderably. When using 10 CPUs, the calculation gets as fast as Z88I2 and even 22.3 minutes fasterwith 243 CPUs. This correlation is shown in Figure 5.14.

36

6 Conclusion

The main idea of the thesis was to solve displacement equations, to test and to comapare various solversand preconditioners. These displacements of the component parts are one of the application fields of LE.They were discretized with the help of FEM to the linear equation system (2.22). Afterwards this equationwas calculated with the help of the Z88 methods and the parallel solvers and preconditioners from thehypre package. The best results were achieved with a parallel implementation of the CG algorithm andusing ParaSails as a preconditioner. The performance of these hypre methods were increased by usingmultiple CPUs. Figure 6.1 shows the deformation for the component parts for the fan.

(a) Undeformed (b) Deformed

Figure 6.1: Deformation of the fan.

37

Bibliography

[Bab04] Babel. https://computation.llnl.gov/casc/components/babel.html [Accessed23.11.2009], 2004.

[Cho00] E. Chow. Parallel implementation and performance characteristics of least squares sparseapproxiamte inverse preconditioners. Int. J. High Perf. Comput. Apps, 2000.

[Fal06] R.D. Falgout. An introduction to algebraic multigrid. Computing in Science and Engineering,2006.

[FY02] R. D. Falgout and U.M. Yang. hypre: a library of high performance preconditioners. Compu-tational Science, 2002.

[HYP06a] Hypre - Reference Manual. https://computation.llnl.gov/casc/hypre/software.html[Accessed 23.11.2009], 2006.

[HYP06b] Hypre - User’s Manual. https://computation.llnl.gov/casc/hypre/software.html [Ac-cessed 23.11.2009], 2006.

[HYP09] Hypre. https://computation.llnl.gov/casc/linear_solvers/sls_hypre.html [Accessed23.11.2009], 2009.

[JL01] M. Jung and U. Langer. Methode der finiten Elemente fur Ingenieure. Teubner Verlag, 2001.

[MGS03] D. Oeltz M. Griebel and M. A. Schweitzer. An algebraic multigrid method for linear elasticity.SIAM, 2003.

[MPI09] MPI. http://www.mcs.anl.gov/research/projects/mpi/ [Accessed 23.11.2009], 2009.

[oE09] Regional Computing Center of Erlangen. Woodcrest Cluster. http://www.rrze.uni-erlangen.de/dienste/arbeiten-rechnen/hpc/systeme/woodcrest-cluster.shtml[Accessed 23.11.2009], 2009.

[RH09] F. Rieg and R. Hackenschmidt. Finite Elemente Analyse fur Ingenieure. HANSER, 2009.

[Rie08] F. Rieg. Z88 - User’s Manual. http://www.z88.uni-bayreuth.de/english.html [Accessed23.11.2009], 2008.

[She94] J.R. Shewchuk. An introduction to the conjugate gradient method without the agonizing pain.Technical report, School of Computer Science, Carnegie Mellon University, 1994.

[TO00] U. Trottenberg and C. W. Oosterlee. Multigrid: Basics, Parallelism and Adaptivity. AcademicPress, 2000.

[WBM99] V.E. Henson W. Briggs and S. McCormick. A Multigrid Tutorial. SIAM, 1999.

[Z8809] Z88. http://www.z88.uni-bayreuth.de/ [Accessed 23.11.2009], 2009.

39

https://computation.llnl.gov/casc/components/babel.html

https://computation.llnl.gov/casc/hypre/software.html

https://computation.llnl.gov/casc/hypre/software.html

https://computation.llnl.gov/casc/linear_solvers/sls_hypre.html

http://www.mcs.anl.gov/research/projects/mpi/

http://www.rrze.uni-erlangen.de/dienste/arbeiten-rechnen/hpc/systeme/woodcrest-cluster.shtml

http://www.rrze.uni-erlangen.de/dienste/arbeiten-rechnen/hpc/systeme/woodcrest-cluster.shtml

http://www.z88.uni-bayreuth.de/english.html

http://www.z88.uni-bayreuth.de/

lehrstuhl fur informatik 10 (systemsimulation) · lehrstuhl fur informatik 10 (systemsimulation)...

Documents