an algorithm with polylog parallel complexity …stefan.vandewalle/... · difference or finite...

11
SIAM J. ScI. COMPUT. Vol. 16, No. 3, pp. 531-541, May 1995 () 1994 Society for Industrial and Applied Mathematics 002 AN ALGORITHM WITH POLYLOG PARALLEL COMPLEXITY FOR SOLVING PARABOLIC PARTIAL DIFFERENTIAL EQUATIONS* G. HORTON , S. VANDEWALLE , AND P. WORLEY Abstract. The standard numerical algorithms for solving parabolic partial differential equations are inherently sequential in the time direction. This paper describes an algorithm for the time-accurate solution of certain classes of parabolic partial differential equations that can be parallelized in both time and space. It has a serial complexity that is proportional to the serial complexities of the best-known algorithms. The algorithm is a variant of the multigrid waveform relaxation method where the scalar ordinary differential equations that make up the kernel ofcomputation are solved using a cyclic reduction-type algorithm. Experimental results obtained on a massively parallel multiprocessor are presented. Key words, parabolic partial differential equations, massively parallel computation, waveform relaxation, multi- grid, cyclic reduction AMS subject classifications, primary 65M, 65W; secondary 65L05 1. Introduction. For many numerical problems in scientific computation, the execution time grows without bound as a function of the problem size, independent of the number of processors and of the algorithm used [36], [38], [40]. In particular, for most linear partial differential equations (PDEs) arising in mathematical physics, the parallel complexity grows as O(log N), where N is a particular measure of the problem size. The proof is based on deriving upper and lower bounds on the execution time of optimal parallel algorithms for multiprocessors with an unlimited number of processors and no interprocessor communication costs, where both upper and lower bounds are proportional to log N. These optimal parallel algorithms can have very large serial complexities, and the tightness of the bounds on the parallel execution time for practical algorithms is not established by this analysis. In the analysis of standard numerical algorithms for linear PDEs, there is a strong di- chotomy in the nature of the growth in the parallel execution time between algorithms for the solution of time-dependent and time-independent PDEs [37], a dichotomy that is notpresent in the optimal algorithm analysis. For example, when approximating elliptic PDEs using finite difference or finite element discretizations, the serial complexity is at least (R)(Ns), where Ns is the size of the underlying grid and (R)(x) denotes a positive quantity whose leading order term is proportional to x. This linear serial complexity can often be achieved using a full multigrid V-cycle algorithm, weighted Jacobi, or multicolor Gauss-Seidel smoothing, and local restriction/prolongation operators, which has a parallel complexity of (R)(log 2 Ns) on a multiprocessor with (R)(Ns) processors [3], [9], [25]. So, for these problems, there exists an algorithm whose serial complexity is proportional to that of the best serial algorithm and whose parallel complexity is a polylog function of the serial complexity. *Received by the editors August 21,1991; accepted for publication (in revised form) February 22, 1994. Lehrstuhl fur Rechnerstrukturen (IMMD 3), Universitat Erlangen-Nurnberg, Martensstrasse 3, D-8520 Erlan- gen, Federal Republic of Germany (graham@ immd3, informat ik. uni-erlangen, de). tDepartment of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, B-3001 Leu- ven(Heverlee), Belgium (stefanScs. kuleuven, ac. be). Mathematical Sciences Section, Oak Ridge National Laboratory, P.O. Box 2008, Oak Ridge, Tennessee 37831- 6367 (worley@msr. epm. o rnl. gov) This research was supported by the Applied Mathematical Sciences Research Program, Office of Energy Research, U.S. Department of Energy under contract DE-AC05-84OR21400 with Martin Marietta Energy Systems Inc. The following text presents research results of the Belgian Incentive Program "Information Technology"- Computer Science of the future, initiated by the Belgian State, Prime Minister’s Service, Science Policy Office. The scientific responsibility is assumed by its authors. 531

Upload: others

Post on 01-Jun-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AN ALGORITHM WITH POLYLOG PARALLEL COMPLEXITY …stefan.vandewalle/... · difference or finite element discretizations ofparabolic PDEshave serial complexities that arelinearinthenumber

SIAM J. ScI. COMPUT.Vol. 16, No. 3, pp. 531-541, May 1995

() 1994 Society for Industrial and Applied Mathematics002

AN ALGORITHM WITH POLYLOG PARALLEL COMPLEXITY FOR SOLVINGPARABOLIC PARTIAL DIFFERENTIAL EQUATIONS*

G. HORTON, S. VANDEWALLE, AND P. WORLEY

Abstract. The standard numerical algorithms for solving parabolic partial differential equations are inherentlysequential in the time direction. This paper describes an algorithm for the time-accurate solution of certain classes ofparabolic partial differential equations that can be parallelized in both time and space. It has a serial complexity thatis proportional to the serial complexities of the best-known algorithms. The algorithm is a variant of the multigridwaveformrelaxation method where the scalar ordinary differential equations thatmake up the kernel ofcomputation aresolved using a cyclic reduction-type algorithm. Experimental results obtained on a massively parallel multiprocessorare presented.

Key words, parabolic partial differential equations, massively parallel computation, waveform relaxation, multi-grid, cyclic reduction

AMS subject classifications, primary 65M, 65W; secondary 65L05

1. Introduction. For many numerical problems in scientific computation, the executiontime grows without bound as a function of the problem size, independent of the number ofprocessors and of the algorithm used [36], [38], [40]. In particular, for most linear partialdifferential equations (PDEs) arising in mathematical physics, the parallel complexity growsas O(log N), where N is a particular measure of the problem size. The proof is based onderiving upper and lower bounds on the execution time of optimal parallel algorithms formultiprocessors with an unlimited number ofprocessors and no interprocessor communicationcosts, where both upper and lower bounds are proportional to log N. These optimal parallelalgorithms can have very large serial complexities, and the tightness of the bounds on theparallel execution time for practical algorithms is not established by this analysis.

In the analysis of standard numerical algorithms for linear PDEs, there is a strong di-chotomy in the nature of the growth in the parallel execution time between algorithms for thesolution oftime-dependent and time-independent PDEs [37], a dichotomy that is notpresent inthe optimal algorithm analysis. For example, when approximating elliptic PDEs using finitedifference or finite element discretizations, the serial complexity is at least (R)(Ns), where Nsis the size of the underlying grid and (R)(x) denotes a positive quantity whose leading orderterm is proportional to x. This linear serial complexity can often be achieved using a fullmultigrid V-cycle algorithm, weighted Jacobi, or multicolor Gauss-Seidel smoothing, andlocal restriction/prolongation operators, which has a parallel complexity of (R)(log2 Ns) ona multiprocessor with (R)(Ns) processors [3], [9], [25]. So, for these problems, there existsan algorithm whose serial complexity is proportional to that of the best serial algorithm andwhose parallel complexity is a polylog function of the serial complexity.

*Received by the editors August 21,1991; accepted for publication (in revised form) February 22, 1994.Lehrstuhl fur Rechnerstrukturen (IMMD 3), Universitat Erlangen-Nurnberg, Martensstrasse 3, D-8520 Erlan-

gen, Federal Republic of Germany (graham@immd3, informat ik. uni-erlangen, de).tDepartment of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, B-3001 Leu-

ven(Heverlee), Belgium (stefanScs. kuleuven, ac. be).Mathematical Sciences Section, Oak Ridge National Laboratory, P.O. Box 2008, Oak Ridge, Tennessee 37831-

6367 (worley@msr. epm. ornl. gov) This research was supported by the Applied Mathematical SciencesResearch Program, Office of Energy Research, U.S. Department of Energy under contract DE-AC05-84OR21400with Martin Marietta Energy Systems Inc.

The following text presents research results of the Belgian Incentive Program "Information Technology"-Computer Science of the future, initiated by the Belgian State, Prime Minister’s Service, Science Policy Office.The scientific responsibility is assumed by its authors.

531

Page 2: AN ALGORITHM WITH POLYLOG PARALLEL COMPLEXITY …stefan.vandewalle/... · difference or finite element discretizations ofparabolic PDEshave serial complexities that arelinearinthenumber

532 G. HORTON, S. VANDEWALLE, AND P. WORLEY

Time-stepping methods are commonly used to calculate the time-accurate solution oftime-dependent PDEs. For a time-accurate solution, the solution is required at a sequence of timesti 0 Nt }, where ti- < ti and ti ti- is small enough to allow accurate interpolationof the solution at all times in between. Standard time-stepping algorithms based on finitedifference or finite element discretizations of parabolic PDEs have serial complexities thatare linear in the number of space-time locations where the solution function is approximated.Thus, the serial complexity is (R) (Ns. Nt), where Ns is the size of the underlying grid at a fixedtime and Nt is the number of time levels. The computations at each time level are usuallyeasily parallelized, but the processing of the time direction in a time-stepping algorithm isinherently sequential. Thus, the parallel complexity is always at least (R)(Nt), i.e., not apolylog function of the serial complexity. Variants of the standard time-stepping algorithmshave been proposed that begin the calculation for later time levels before the current time levelis finished [10], [28], [35]. These algorithms, however, do not alter the sequential nature ofthe time direction. Alternatively, the multigrid idea has been extended to parabolic problemsdiscretized on space-time grids in [13] and a time-parallel variant was studied in [1], [4],[5], [16]. This parabolic multigrid algorithm allows computations to proceed concurrentlyon different time levels, yet lacks robustness w.r.t, the mesh-size parameters and is limitedto particular time-discretization schemes and assumptions on the smoothness of the time-dependent data (forcing function) [39].

This paper addresses the question of whether good serial algorithms for parabolic PDEsare intrinsically less parallel than good serial algorithms for elliptic PDEs. We pose thequestion in the following form. For a given parabolic PDE, is there a numerical algorithmfor the time-accurate approximation of the solution with thefollowing properties.

(i) The serial complexity of the algorithm is proportional to the serial complexity ofthe best-known serial algorithmfor this problem.

(ii) The algorithm can be parallelized in the time direction as well as the spatial direc-tions, so that the achievable parallel complexity given an unlimited number ofprocessors is apolylogfunction ofthe serial complexity.

In 2, we describe a class of algorithms that have properties (i) and (ii) for a large classof linear parabolic PDEs. Not only does this class of algorithms answer the above theoreticalquestion, it also has practical applications on massively parallel multiprocessors. This isillustrated in 3 where timing results obtained on a massively parallel SIMD multiprocessorare given. The algorithm can be adapted and made into an algorithm for solving nonlinearparabolic problems and systems of equations. This is briefly discussed in 4, where we alsomention how it can be modified to run efficiently on large MIMD multiprocessors. We finishin 5 with some concluding remarks.

Note that the question posed above can also be asked of hyperbolic PDEs. The class ofalgorithms described here is not effective for hyperbolic problems, but a similar approach canbe used to develop algorithms with properties (i) and (ii) for a restricted, but important, classof constant coefficient hyperbolic PDEs. See [41 for details.

2. Multigrid waveform relaxation.

2.1. Waveform relaxation. Waveform relaxation is a technique for solving systems ofordinary differential equations (ODEs) of initial-value-type [23], [27]. It is based on applyingstandard point and block iterative methods for the solution of linear systems to the solutionof a system of ODE:.. For example, let A be an n x n matrix, and consider the problemdU/dt + AU F, where U and F are vector functions of time. Then the kth step of aJacobi-based iterative method for the solution of this system is

d uk + DU F + (D- A)U(k-l)(1)dt

Page 3: AN ALGORITHM WITH POLYLOG PARALLEL COMPLEXITY …stefan.vandewalle/... · difference or finite element discretizations ofparabolic PDEshave serial complexities that arelinearinthenumber

A POLYLOG PARALLEL PARABOLIC SOLVER 533

where D is the diagonal of A. Thus, each step of the method involves the solution of nindependent scalar ODEs. The decoupling of the system allows different discretizations andtimesteps to be used for each of the ODEs, which can lead to significant savings for someapplications.

To solve parabolic PDEs, the spatial derivatives are discretized to generate a semidiscreteproblem and the resulting system ofODEs is solved using waveform relaxation. Miekkala andNevanlinna have analyzed the convergence of waveform relaxation for linear operators [26].They showed that, for linear PDEs of the form ut + Lu f where L is an elliptic operatorand for standard spatial discretizations, the convergence rates for Jacobi and Gauss-Seideliterations for the semidiscrete problem are often similar to those for the analogous discreteelliptic problem.

2.2. Multigrid acceleration. Let Lh represent the discrete operator generated by dis-cretizing the elliptic operator L on a grid of meshsize h. The correspondence between theconvergence rate of waveform relaxation applied to dU/dt + LhU F and the convergencerate of the analogous matrix iterative method applied to Lh U F has two immediate im-plications. First, the convergence rate is too slow for waveform relaxation to be competitivewith standard time-stepping algorithms. Second, multigrid techniques may be effective ataccelerating convergence of the iteration.

Multigrid acceleration has been analyzed by Lubich and Ostermann [24]. Among theirresults, they showed that multigrid performance, i.e., a convergence factor bounded by a smallconstant independent of h, can be achieved for the semidiscrete problem if Lh is a symmetricpositive definite matrix with constant diagonal entries, and either weighted Jacobi relaxationis used, or Lh has the form

where D is diagonal, and Gauss-Seidel relaxation is used. Note that this latter matrix structurecommonly occurs when using a red-black ordering with standard finite difference stencils.They also show that multigrid performance can be achieved for the fully discrete problem if,in addition to the above conditions, all ODEs are discretized by the same method, the timedirection is not coarsened, and the ODE solver is an A(c)-stable linear multistep methodwith suitably large ot or an algebraically stable Runge-Kutta method. Note that all of theseconditions are sufficient, but not necessary.

This multigrid waveform relaxation algorithm has been tested extensively [31 ]-[33], andhas been shown to work well for a variety of parabolic problems, both linear and nonlinear,on both serial and parallel computers. In particular it has been shown that the use of multigridV-cycles leads in many cases to a rapidly converging iteration with typical convergence factorsin the range of 0.06 to 0.2, independent of the size of the spatial mesh and independent of thelength of the time-integration interval or the number of time levels. In such cases a nestediteration or full multigrid strategy that employs a small number of V-cycles on each gridlevel results in an algorithm with linear serial complexity and an algebraic error (norm ofthe difference between approximation and converged discrete solution) that is smaller thanthe discretization error (norm of the difference between the converged discrete solution andthe solution of the PDE). In addition, it is illustrated in the above references that the serialcomputation time of the waveform algorithm is similar to (and often smaller than) the serialcomputation time of a corresponding time-stepping code.

2.3. Multigrid waveform relaxation with cyclic reduction. The waveform relaxationalgorithm is normally implemented in a fashion that is still intrinsically sequential in the

Page 4: AN ALGORITHM WITH POLYLOG PARALLEL COMPLEXITY …stefan.vandewalle/... · difference or finite element discretizations ofparabolic PDEshave serial complexities that arelinearinthenumber

534 G. HORTON, S. VANDEWALLE, AND P. WORLEY

time direction. Some time parallelism was introduced in the method described in [34], but a(small) sequential component remained. Here, we introduce parallelism in the time directionby exploiting an idea from the parallel-ODE literature [2], [6], [11 ], 12].

The computation in the time direction only involves solving linear scalar ODEs. If theODEs are solved using a linear multistep method with a statically determined timestep, theneach ODE solution corresponds to the solution of a banded lower triangular matrix equation,or, equivalently, a linear recurrence. Parallelizing linear recurrence equations has been studiedextensively [8], [15], [21], [22], [29]. In particular, if a cyclic reduction approach is used toparallelize the linear recurrence, then parallelism is introduced without increasing the orderof the serial complexity. For example, if a two-level method is used to discretize the scalarODE, then a lower bidiagonal matrix equation must be solved. Cyclic reduction combineseven-numbered equations with odd-numbered equations to generate a new bidiagonal matrixequation of half the size. If the original matrix equation has the form

(3)

alla21 a22

a32 a33a43 a44

a54 a55a65 a66

a76 a77a87

Xlx2x3x4x5x6x7

a88 x8

then one step of the cyclic reduction algorithm generates a new matrix equation of the form

(4)

a22

a43 a32 a44

a65a55 a54 a66

a87a77 a76

x2 f2 a21-flX4 f4- a43a-f3X6 f6- a65-fs

a88 x8 f8 a87-77 f7

The solution vector of the smaller system is identical to the even-numbered elements of thesolution vector of the original matrix equation. This process is repeated until only one or twoequations are left, at which time this small linear system is solved. The solution values for thesmall system are then used to calculate the unresolved values in the next larger linear system.By repeating this process, the solution to the original matrix equation is calculated. Note thateach step of the reduction stage is perfectly parallel in the sense that combining each pair ofequations is independent. Similarly, injecting solution values into the next larger system andsolving for the unresolved variables can be done independently for each unresolved variable.Thus, cyclic reduction allows us to parallelize the time direction.

If a k-level linear multistep algorithm is used to solve the ODE, then the banded lowertriangular matrix defining the linear recurrence has a bandwidth of k. The cyclic reductionalgorithm again halves the number of equations at each step, but now k consecutiveequations are needed to transform the dependencies in a given equation from the k- previousvalues in the current matrix equation to the k previous values in the new smaller system.As before, this process is continued until only k or fewer equations are left. If k > 2, thensolving this small system and injecting the solution back into the next larger system does not

decouple the calculation of the unresolved variables. Instead, it produces a new banded lowertriangular matrix equation to solve, one whose bandwidth is [k/2]. Repeatedly applyingthe cyclic reduction algorithm continues to halve the bandwidth until all of the unresolvedvariables are calculated, at which time they can be injected into the next larger system toreduce its bandwidth.

Page 5: AN ALGORITHM WITH POLYLOG PARALLEL COMPLEXITY …stefan.vandewalle/... · difference or finite element discretizations ofparabolic PDEshave serial complexities that arelinearinthenumber

A POLYLOG PARALLEL PARABOLIC SOLVER 535

The cyclic reduction algorithm is more expensive than the standard serial algorithm,but the complexity is still (R)(Nt) for each ODE solution (if k is independent of Nt). Forexample, for a two-level method, the serial complexity of the standard algorithm is 3Nt, whilefor the cyclic reduction algorithm it is 5Nt or 7Nt, depending on whether certain values areprecomputed. For a three-level scheme, the complexity of the standard algorithm is 5Nt,

compared with 11Nt or 21Nt for the cyclic reduction algorithm. As long as the complexity ofthe ODE solver is (R) (Nt), the multigrid waveform relaxation algorithm remains an (R) (Ns. Nt)complexity algorithm.

The parallel complexity of the cyclic reduction algorithm is a function of the numberof time levels used in the discretization of the ODE. For a k-level scheme, it is (R) (log Nt),where y k/2]. Including this cyclic reduction algorithm in a parallel multigrid waveformrelaxation algorithm that uses a weighted Jacobi or red-black Gauss-Seidel smoother leads toa parallel complexity of (R) (log Ns log Nt) for a single V-cycle on a hierarchy of (R) (log Ns)multigrid levels. The total parallel complexity of the full multigrid algorithm with a constantnumber of V-cycles per grid level is then of the form (R) (loge Ns logy Nt), which is polylog.

2.4. Multigrid waveform relaxation with parallel cyclic reduction. The algorithmdescribed above, multigrid waveform relaxation with cyclic reduction, was motivated by thetheoretical question introduced in 1. The following generalization is motivated by practicalissues.

By imitating Hockney’s parallel cyclic reduction algorithm (PARACR) 15], also calledthe recursive doubling method, we can lower the parallel complexity of the cyclic reductionalgorithm to (R)(log Nt), independent of the length of the recurrence, without increasing thenumber of processors needed. The idea is to modify all equations at each reduction step.Returning to equation (3), PARACR transforms this.linear system into two decoupled linearsystems of half the size: system (4) and

(5)

all Xl fla32 a a32a22 21 a33 x3 f3- f2

a54 a54a44 a43 a55 x5 f5 -f4

.a76 x7 a766---65 a77 f7- f6

Thus, after log2 Nt steps, there are Nt independent equations whose solutions solve theoriginal problem. While the resulting algorithm has a serial complexity of (R)(Nt log Nt),this is unimportant on a multiprocessor with (R)(Nt) processors. On such multiprocessors,PARACR is substantially faster than the original cyclic reduction algorithm because it avoidsthe backsubstitution phase. It also eliminates the recursive halving of the bandwidth requiredin the elimination step when k > 2.

3. Complexity and computational experiment. In this section, we verify the predictedparallel complexity of the multigrid waveform relaxation algorithm. We concentrate in par-ticular on the PARACR or recursive doubling variant, which, within rounding error, providesthe same numerical results as the original cyclic-reduction-based algorithm. Not only is thecomplexity of this variant smaller than that of the basic cyclic-reduction method, its imple-mentation on a massively parallel multiprocessor is also more straightforward.

We consider a linear parabolic PDE with variable coefficients on a two-dimensionalrectangular domain with Dirichlet boundary conditions. The problem is discretized in spaceon a regular rectangular grid with an equal number of grid lines in both spatial dimensions,leading to a five point finite difference stencil. Backward Euler (BDF1), a two-level scheme,

Page 6: AN ALGORITHM WITH POLYLOG PARALLEL COMPLEXITY …stefan.vandewalle/... · difference or finite element discretizations ofparabolic PDEshave serial complexities that arelinearinthenumber

536 G. HORTON, S. VANDEWALLE, AND P. WORLEY

the second order backward differentiation formula (BDF2), a three-level scheme, and theCrank-Nicolson method (CN), a two-level scheme, are used to discretize in time. The totalnumber of grid points, not counting the ones with known values on the boundaries and on theinitial time level, is given by Nx x Ny x Nt, with

(6) Nx Ny 2P- 1 and Nt 2Q

for positive integer values P and Q. The algorithm employs a multigrid strategy where thegrid hierarchy with P levels is defined by standard spatial coarsening from the fine grid downto a coarse grid with a single unknown per time level. Bilinear prolongation and standard fullweighting are used as intergrid transfer operators. The smoother is a Gauss-Seidel waveformrelaxation method with a red-black ordering of the grid points.

In Table we present the parallel complexities of the multigrid waveform relaxationoperators. (One processor per grid point is assumed.) Given are the number of parallelstages--each corresponding to a floating point operation--required in the application of asmoothing half-step (red or black) (Cs), in the defect calculation (Co), in the restriction (CR),in the prolongation (Cp), and in the correction (Cc). Note that Co, CR, Cp, and Cc arecompletely independent of the size of the grid. For deriving CR we took into account that thedefect after a red-black step is zero at half the number of grid points. The constant term inthe formula for Cs is the complexity of constructing the linear recurrence, whereas the linearterm is the complexity of the parallel cyclic reduction algorithm.

TABLEParallel complexity ofmultigrid waveform relaxation operators.

Method Cs(Q) Co CR Cp CcBDF1 9+3 Q 12 6 4BDF2 9+11Q 14 6 4CN 10+3 Q 14 6 4

The complexity of a multigrid V-cycle, denoted by Cv(P, Q), can be calculated from

(7)

(8)Cv(l, Q) Cs(Q),

Cv(P, Q) Cv(P 1, Q) + v 2Cs(Q) + Co + C + Cp + Cc

where v denotes the total number of red-black smoothing steps per grid level. For BDF1,BDF2, and CN we obtain

(9) Cv(P, Q) 6v PQ + (23 + 18v)P + (3-6v) Q 14- 18v,

(10) Cv(P, Q) 22v PQ + (25 + 18v) P + (11 22v) Q 16- 18v,

(11) Cv(P, Q) 6v PQ -k- (25 + 20v) P + (3 6v) Q 15 20v,

respectively. From (7) and (8), one readily derives the parallel complexity of the full multigridwaveform relaxation algorithm. Let 6 be the number of V-cycles per grid level (usually or2). Then

P

CFMG(P, Q) Ci + Cv(1, Q) + _(C + Ce + 6. Cv(i, Q))(12)i=2

where CI is the 6)(1) cost of initialising certain variables before starting the actual multigridcomputations on a new grid level. With (9), (10), and (11) this gives

(13) CFMG(P, Q) 6)(p2 Q),

Page 7: AN ALGORITHM WITH POLYLOG PARALLEL COMPLEXITY …stefan.vandewalle/... · difference or finite element discretizations ofparabolic PDEshave serial complexities that arelinearinthenumber

A POLYLOG PARALLEL PARABOLIC SOLVER 537

for BDF1, BDF2, and CN, establishing the predicted polylog parallel complexity. The coef-ficient of the leading order term is given by 3 for BDF1 and CN, and by 11v for BDF2.

The algorithm analysed above has been implemented on aCM/2 Connection Machine 14],[30], a massively parallel multiprocessor of SIMD (single instruction stream, multiple datastream) type. The results of a timing experiment performed on a machine with 16K processorsare reported in Figs. 1 and 2. We present execution times obtained with the implementationof the BDF1 method. The curves for the BDF2 and CN schemes are qualitatively verysimilar.

0.45 P 4P=5

0.40

0.35 p

0.300.25

0.20 P 2

0.15

0.10

0.05P=I

a

FIG. 1. Multigrid waveform relaxation with parallel cyclic reduction, BDF1 V-cycle times.

Given in Fig. are the execution times, in seconds, of multigrid V-cycles with 2presmoothing steps and 1 postsmoothing step, for different values of P and Q. In all cases, Pand Q are such that the total number of grid points on the finest grid is smaller than or equal tothe number of processing elements available. The lines are connecting points that correspondto V-cycles on a fixed number of grid levels (P) for different numbers of time levels. Thequalitative correctness of formula (9) is easily verified by inspection of the picture. For afixed value of P the parallel complexity is a linear function of Q with a slope that is linearlydependent on P. A similar rule holds with P and Q interchanged. The latter is verified bylooking at the constant difference between the execution times for successive values of P andfixed value of Q, a difference that increases with growing value of Q.

In Fig. 2 we present the execution time of the full multigrid algorithm. The (problemdependent) value ; was set to one in this experiment. Also, we did not include the problemdependent cost of initialising the boundary values, initial condition values and the PDE right-hand side. The curves clearly illustrate the parabolic dependence of the execution time on thenumber of grid levels, as predicted by (13).

Page 8: AN ALGORITHM WITH POLYLOG PARALLEL COMPLEXITY …stefan.vandewalle/... · difference or finite element discretizations ofparabolic PDEshave serial complexities that arelinearinthenumber

538 G. HORTON, S. VANDEWALLE, AND P. WORLEY

0.6T

0.5

Q=4

Q=6

Q=2

Q=O

I

1 2 3 4 5 6 7

FIG. 2. Multigrid waveform relaxation with parallel cyclic reduction, BDF1 full multigrid times.

4. Extensions.

4.1. A coarse grain parallel algorithm. By imitating the blocked cyclic reduction al-.gorithm discussed in [18] and [19], the communication cost can be reduced to a manage-able size for distributed memory multiprocessors. The algorithm is based on a perfectlyparallel substructuring or partitioning technique to reduce the original system to an inter-mediate system with a single equation per processor. For example, if Pt processors are assignedto the solution of an ODE, the blocked algorithm generates a small intermediate Pt x Pt linearsystem whose solution by (parallel) cyclic reduction introduces Pt-way parallelism into the restof the calculation. This coarse grain parallel algorithm is very similar to the space-time con-current waveform algorithm of [34], where a different method is used to solve the intermediatesystem.

4.2. Nonlinear problems and systems of parabolic problems. Nonlinear parabolicPDEs can be solved by using a nonlinear orfull approximation scheme multigrid waveformmethod [31 ], [32]. These algorithms are straightforward extensions of the well-known multi-grid methods for nonlinear elliptic problems. The nonlinearity is treated in the smoothingsteps, where each nonlinear ODE is first linearized (as an ODE in a single unknown) and thensolved by a standard ODE method for linear problems. Consequently, the serial and parallelcomplexities of the algorithms for nonlinear problems are similar to those of the algorithmsfor linear problems.

The algorithms for linear and nonlinear equations are readily extended to algorithms forsolving systems of linear or nonlinear equations, by using a block cyclic reduction method,to solve the set of ODEs at each grid point. Although the serial complexity of the block

Page 9: AN ALGORITHM WITH POLYLOG PARALLEL COMPLEXITY …stefan.vandewalle/... · difference or finite element discretizations ofparabolic PDEshave serial complexities that arelinearinthenumber

A POLYLOG PARALLEL PARABOLIC SOLVER 539

cyclic reduction method may be many times larger than that of the standard serial algorithm,it remains an (R) (Nt) method. Also the order of the parallel complexity is unaltered.

5. Concluding remarks. The algorithms described in this paper establish that majorclasses of parabolic PDEs can be solved in polylog parallel time without giving up linear serialcomplexity. Moreover, minor variants of these algorithms with log-linear serial complexitycan be efficiently implemented on massively parallel SIMD multiprocessors.

Using cyclic reduction within the multigrid waveform relaxation algorithm has the desiredproperties for all problems for which (a) the full multigrid waveform relaxation algorithm has alinear serial complexity when using a relaxation technique that can be efficiently parallelized,like Jacobi or multicolor Gauss-Seidel, and (b) cyclic reduction is a stable algorithm forsolving the linear recurrences arising from discretizing the scalar ODEs. Both theory andempirical evidence indicate that (a) is true for a large class of parabolic problems, both linearand nonlinear. While some work on the stability ofparallelizing recurrence equations has beendone (see [20], [29]) and the numerical examples computed by the authors give no indicationof stability problems, more work must be done to establish the stability of this algorithm, inparticular when the bandwidth of the recurrence is large and/or the system is not diagonallydominant. This is especially true given the stability problems of the obvious implementationof cyclic reduction for elliptic PDEs [7].

When properties (a) and (b) hold, we have shown the algorithm to have an achievableparallel complexity of (R) (log2 Ns. log Nt), where , [k/2q when using cyclic reduction and?’ 1 when using recursive doubling. This is still higher than that for an elliptic problem withthe same number of grid points, and the question arises as to whether it can be decreased to(R)(log2 N) with N Ns. Nt. The log Nt-factor in the complexity of the multigrid waveformalgorithm arises from the parallel complexity of the cyclic reduction algorithm. It couldbe avoided in two different ways: either by the use of an approximate solver for the lowertriangular system with constant parallel complexity, or by the use of a strategy that coarsensthe grid in time (as well as space) during the multigrid process. The former approach is the onefollowed in the parabolic multigrid algorithm with parallel smoothing, by Hackbusch [13],where a Jacobi relaxation step is used to approximate the linear system solution. As mentionedin 1, this approach has some limitations w.r.t, mesh ratios, discretization schemes, and datasmoothness assumptions. The latter approach is explored in a paper by Horton and Vandewalle17]. Their algorithm is also not as generally applicable as the multigrid waveform method

but performs satisfactorily in particular for the BDF1 and BDF2 discretizations of the timederivative.

Finally, beyond answering the theoretical question of 1, the multigrid waveform algo-rithm has a clear promise as a practical parallel algorithm, since it is a competitive serialalgorithm for many applications. It is realised, however, that adaptive time-step selection andspatial grid adaptivity are crucial to the success of good solvers for the often highly nonlinearparabolic problems that arise in practical applications. Although the waveform relaxation ideaallows in principle the use of local step sizes "varying in different regions of the computationaldomain, and the full multigrid algorithm enables local error estimation based on the availabil-ity of solutions on different grid levels, it is not clear how this will affect the actual parallelperformance of the algorithm.

Acknowledgments. The authors would like to thank Thinking Machines Corporation,Cambridge, MA, for providing access to the Connection Machine Network Server pilot facility,and the Gesellschaft ftir Mathematik und Datenverarbeitung, Sankt-Augustin, Germany, forletting them use the CM/2 machine at the GMD.

Page 10: AN ALGORITHM WITH POLYLOG PARALLEL COMPLEXITY …stefan.vandewalle/... · difference or finite element discretizations ofparabolic PDEshave serial complexities that arelinearinthenumber

540 G. HORTON, S. VANDEWALLE, AND P. WORLEY

REFERENCES

E BASTIAN, J. BURMEISTER, AND G. HORTON, Implementation ofaparallel multigrid methodforparabolicpartialdifferential equations, in Parallel Algorithms for PDEs, Proc. 6th GAMM Seminar Kiel, January 19-2 l,1990, W. Hackbusch, ed., Vieweg Verlag, Wiesbaden, 1990, pp. 18-27.

[2] A. BELLEN AND M. ZENNARO, Parallel algorithms for initial value problems for difference and differentialequations, J. Comput. Appl. Math., 25 (1984), pp. 341-353.

[3] A. BRANDT, Multigrid solvers on parallel computers, in Elliptic Problem Solvers, M. Schultz, ed., AcademicPress, New York, 1981, pp. 39-83.

[4] J. BURMEISTER, Paralleles lOsen diskreterparabolischer Probleme mit Mehrgittertechniken, diplomarbeit, Uni-versitit Kiel, 1985.

[5] J. BURMEISTER AND G. HORTON, Time-parallel multigrid solution of the Navier-Stokes equations, in Multigridmethods III, Proc. 3rd European Multigrid Conference, Bonn, 1990, W. Hackbusch and U. Trottenberg,eds., Birkhatiser Verlag, Basel, 1991, pp. 155-166.

[6] K. BURRAGE, Parallel methodsfor initial value problems, Appl. Numer. Math., 11 (1993), pp. 5-25.[7] B. BUZBEE, G. GOLUB, AND C. NIELSON, On direct methodsfor solving Poisson’s equations, SIAM J. Numer.

Anal., 7 (1970), pp. 637-657.[8] S. CHEN AND D. KUCK, Time andparallelprocessor boundsfor linear recurrence systems, IEEE Trans. Comput.,

C-24 (1975), pp. 701-717.[9] N. DECKER, Note on the parallel efficiency of the Frederickson-McBryan multigrid algorithm, SIAM J. Sci.

Statist. Comput., 12 (1991), pp. 208-220.[10] H. EISSFELLER AND S. MOLLER, The triangle methodfor saving startup time in parallel computers, in Proc. 5th

Distributed Memory Computing Conference, Los Alamitos, CA, 1990, D. Walker and Q. Stout, eds., IEEEComputer Society Press, Washington, D.C., pp. 568-572.

[11] C. GEAR, Waveform methodsfor space and time parallelism, J. Comput. Appl. Math., 38 (1991), pp. 137-147.[12] C. GEAR AND X. XUHAI, Parallelism across time in ODEs, Appl. Numer. Math., 11 (1993), pp. 45-68.13] W. HACKBUSCH, Parabolic multigrid methods, in Computing Methods in Applied Sciences and Engineering VI,

R. Glowinski and J.-L. Lions, eds., North Holland, Amsterdam, 1984, pp. 189-197.14] D. HILLIS, The Connection Machine, MIT Press, Cambridge, MA, 1985.15] R. HOCKNEY AND C. JESSHOPE, Parallel Computers: Architecture, Programming, andAlgorithms, Adam Hilger

Ltd., Bristol, UK, 1981.[16] G. HORTON, The time-parallel multigrid method, Comm. Appl. Numer. Meth., 8 (1992), pp. 585-595.[17] G. nORTON AND S. VANDEWALLE, A space-time multigrid method for parabolic P.D.E.S, Tech. report 6/93,

IMMD 3, Universitat Erlangen-Ntirnberg, June 1993.[18] S. JOHNSSON, Solving tridiagonal systems on ensemble architectures, SIAM J. Sci. Statist. Comput., 8 (1987),

pp. 354-392.19] S. JOHNSSON, Y. SAAD, AND M. SCnULTZ, Alternating direction methods on multiprocessors, SIAM J. Sci. Statist.

Comput., 8 (1987), pp. 686-700[20] E KOGGE, The numerical stability ofparallel algorithms for solving recurrence problems, Tech. report 44,

Digital Systems Laboratory, Stanford Electronics Laboratories, Stanford University, Stanford, CA, Sept.1972.

[21] Parallel solution of recurrence problems, IBM J. Res. Develop., 18 (1974), pp. 138-148.[22] E KOGGE AND S. STONE, A parallel algorithmfor the efficient solution ofa general class ofrecurrence equations,

IEEE Trans. Comput., C-22 (1973), pp. 786-793.[23] E. LELARASMEE, A. RUEHLI, AND A. SANGIOVANNI-VINCENTELLI, The waveform relaxation method for time-

domain analysis oflarge scale integrated circuits, IEEE Trans. Computer-Aided Design, (1982), pp. 131-145.

[24] C. LUBICH AND A. OSTERMANN. Multigrid dynamic iterationforparabolic equations, BIT, 27 (1987), pp. 216-234.

[25] O. MCBRYAN, P. FREDERICKSON, J. LINDEN, A. SCHOLLER, K. SOLCHENBACH, K. STOBEN, C. THOLE, AND

U. TROTTENBERG, Multigrid methods on parallel computers--a survey of recent developments, ImpactComput. Sci. Engrg., 3 (1991), pp. 1-75.

[26] U. MIEKKALA AND O. NEVANLINNA, Convergence ofdynamic iteration methodsfor initial value problems, SIAMJ. Sci. Statist. Comput., 8 (1987), pp. 459-482.

[27] A. NEWTON AND A. SANGIOVANNI-VINCENTELLI, Relaxation-based electrical simulation, SIAM J. Sci. Statist.

Comput., 4 (1983), pp. 485-524.[28] J. SALTZ, V. NAIK, AND D. NICOL, Reduction ofthe effects ofthe communication delays in scientific algorithms

on message passing MIMD architectures, SIAM J. Sci. Statist. Comput., 8 (1987), pp. 118-s 138.

Page 11: AN ALGORITHM WITH POLYLOG PARALLEL COMPLEXITY …stefan.vandewalle/... · difference or finite element discretizations ofparabolic PDEshave serial complexities that arelinearinthenumber

A POLYLOG PARALLEL PARABOLIC SOLVER 541

[29] A. SAMEH AND R. BRENT, Solving triangular systems on a parallel computer, SIAM J. Numer. Anal., 14 (1977),pp. 1101-1113.

[30] THINKING MACHINES CORPORATION, ConnectionMachine CM-200 Series, Technical Summary, 245 First Street,Cambridge, MA 02142-1264, June 1991.

[31] S. VANDEWALLE, Parallel Multigrid Waveform Relaxation for Parabolic Problems, B.G. Teubner Verlag,Stuttgart, 1993.

[32] S. VANDEWALLE AND R. PIESSENS, Numerical experiments with nonlinear multigrid waveform relaxation on a

parallel processor, Appl. Numer. Math., 8 (1991), pp. 149-161.[33] Efficient parallel algorithms for solving initial-boundary value and time-periodic parabolic partial

differential equations, SIAM J. Sci. Statist. Comput., 13 (1992), pp. 1330-1346.[34] S. VANDEWALLE AND E. VAN DE VELDE, Space-time concurrent multigrid waveform relaxation, Ann. Numer.

Math., (1994), pp. 335-346.[35] D. WOMBLE, A time-stepping algorithm for parallel computers, SIAM J. Sci. Statist. Comput., 11 (1990),

pp. 824-837.[36] P. WORLEV, Information Requirements and the Implications for Parallel Computation, Ph.D. thesis, Stanford

University, Stanford, CA, June 1988.[37] ,The effect oftime constraints on scaled speedup, SIAM J. Sci. Statist. Comput., 11 (1990), pp. 838-835.[38] Limits on parallelism in the numerical solution of linear PDEs, SIAM J. Sci. Statist. Comput., 12

(1991), pp. 1-35.[39] Parallelizing across time when solving time-dependent partial differential equations, Tech. report

ORNL/TM-11963, Oak Ridge National Laboratory, Oak Ridge, TN, Sept. 1991.[40] The effect ofmultiprocessor radius on scaling, Parallel Comput., 18 (1992), pp. 361-376.[41] Parallelizing across time when solving time-dependent partial differential equations, in Proc. 5th

SIAM Conference on Parallel Processing for Scientific Computing, J. Dongarra, K. Kennedy, P. Messina,D. Sorensen, and R. Voigt, eds., Philadelphia, PA, 1992, Society for Industrial and Applied Mathematics,

pp. 246-252.