kec - pdfs.semanticscholar.org · [16] kec kler, s. w. & dally, j. pro cessor coupling: in...

24

Upload: others

Post on 17-Mar-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

[16] Keckler, S. W. & Dally, W. J. Processor coupling: integrating compiletime and run time scheduling for parallelism. Proc. ACM 19th Int'l.Symp. on Comp. Arch., Queensland, Australia (1992) 202-213.[17] Ni, L. M., Xu, C. & Gendreau, T. B. A distributed drafting algorithmfor load balancing. IEEE Trans. Soft. Eng. SE-11 (1985).[18] Nicol, D. M. & Saltz, J. H. Dynamic remapping of parallel computationswith varying resource demands. IEEE Trans. Comp. 37 (1988).[19] Noakes, M. & Dally, W. J. System design of the J-machine. Proc. 6thMIT Conf. on Advanced Research in VLSI. Dally, W. J. (Ed.). MITPress, Cambridge, MA, 1990, pp. 179{194.[20] Pothen, A., Simon, H. D. & Liou, K. Partitioning Sparse Matrices withEigenvectors of Graphs. SIAM J. Matrix Anal. 11 (1990) 430{452.[21] Seitz, C. L. Mosaic C: an experimental �ne-grain multicomputer. Proc.International Conference Celebrating the 25th Anniversary of INRIA,Paris, France, December 1992, Springer-Verlag, New York, NY, 1992.[22] Shen, S. Cooperative Distributed Dynamic Load Balancing. Acta Infor-matica 25 (1988) 663{676.[23] Wang, J. C. T. & Taylor, S. A Concurrent Navier-Stokes Solver forImplicit Multibody Calculations. To appear in Proc. Parallel-CFD '93,Paris, France, June 1993.[24] Rowell, J. Lessons Learned on the Delta. High Perf. Comput. Rev. 1(1993) 21{24.23

Page 2: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

[5] Brug�e, F. & Fornili, S. L. A distributed dynamic load balancer andit's implementation on multi-transputer systems for molecular dynamicssimulation. Comp. Phys. Comm. 60 (1990) 39{45.[6] Cybenko, G. Dynamic load balancing for distributed memory multipro-cessors. J. Parallel Distrib. Comput. 7 (1989) 279{301.[7] Horn, R. A. & Johnson, C. R. Matrix Analysis. Cambridge UniversityPress, New York, NY, 1991.[8] Golub, G. H. & Van Loan, C. F. Matrix Computations (2nd edition).The Johns Hopkins University Press, Baltimore, MD, 1989.[9] Gustafson, J. L. Reevaluating Amdahl's law. Comm. ACM 31 (1988)532{533.[10] Hofstee, H. P., Lukkien, J. J. & van de Snepscheut, J. L. A. A distributedimplementation of a task pool. Research Directions in High Level Par-allel Programming Languages, Banatre, J. P. & Le Metayer, D. (eds.),Lecture Notes in Computer Science 574 (1992) 338-348, Springer: NewYork.[11] Horton, G. A multi-level di�usion method for dynamic load balancing.Parallel Comp. 19 (1993) 209{218.[12] Kumar, V., Ananth, G. Y. & Rao, V. N. Scalable load balancing tech-niques for parallel computers. Preprint 92-021. Army High PerformanceComputing Research Center, Minneapolis, MN, 1992.[13] Lin, F. C. H. & Keller, R. M. The gradient model load balancing method.IEEE Trans. Soft. Eng. SE-13 (1987) 32{38.[14] McCormick, S. F.,Multigrid Methods. Society for Industrial and AppliedMathematics, Philadelphia, PA, 1987.[15] Marinescu, D. C. & Rice, J. R. Synchronization and load imbalancee�ects in distributed memory multi-processor systems. Concurrency:Pract. Exp. 3 (1991) 593{625. 22

Page 3: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

Repeated application of lemma (1) to equation (25) yields1 = c2i;j;k 18 Xx;y;z �1 + cos 4� yjn1=3� 1 + cos 4� zkn1=3!= c2i;j;k 18 "Xx;y;z 1 + cos 4� zkn1=3!+ Xx;y;z �cos 4� jyn1=3� 1 + cos 4� zkn1=3!#= c2i;j;k 18 "Xx;y;z 1 + Xx;y;z cos 4� zkn1=3!#= c2i;j;k 18n (26)From which we conclude ci;j;k = �8n�1=2 8 i; j; k8 AcknowledgementsWe thank Michael Noakes for a concurrent implementation of the iteration(2) in J-Machine assembler. We are grateful to Andrew Conley for helpfuldiscussions and criticisms.References[1] Amdahl, G. Validity of the single processor approach to achieving largescale computing capabilities. Proc. AFIPS Comput. Conf. 30 (1967)483{485.[2] Athas, W. C. & Seitz, C. L. Multicomputers: message passing concur-rent computers. IEEE Comp. 21 (1988) 9{24.[3] Barnard, S. T. & Simon, H. D. A fast multilevel implementation ofrecursive spectral bisection for partitioning unstructured problems. Toappear in Concurrency: Pract. Exp. (1993).[4] Boillat, J. E. Load balancing and poisson equation in a graph. Concur-rency: Pract. Exp. 2 (1990) 289{313.21

Page 4: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

This section demonstrates that the eigenvector normalization constant ci;j;kis equal to (8=n)1=2 for all eigenvectors ~xi;j;k. From (16) a necessary conditionfor a normalized eigenvector is1 = c2i;j;k Xx;y;z cos2 �2� xin1=3� cos2 �2� yjn1=3� cos2 2� zkn1=3!= c2i;j;k 18 Xx;y;z �1 + cos 4� xin1=3��1 + cos 4� yjn1=3� 1 + cos 4� zkn1=3!= c2i;j;k 18 Xx;y;z (�1 + cos 4� yjn1=3� 1 + cos 4� zkn1=3!+Xx cos�4� xin1=3�Xy;z cos�4� yjn1=3� 1 + cos 4� zkn1=3!) (25)Simplify the preceding expression by the followingLemma 1 Xx cos�4� xin1=3� = 0Proof: Xx cos�4� xin1=3� = Xx Re�e{ 4�xin1=3�= ReXx e{ 4�xin1=3= ReXx �e{ 4�in1=3�x= Re2666664e{ 4�in1=3 1 � �e{ 4�in1=3 �n1=3!1� e{ 4�in1=3 3777775= 0Q.E.D. 20

Page 5: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

+ u(�; �; z + dz; �) + u(�; �; z � dz; �)� 2u(�; �; �; �)dz2 !+O(dt; dx2; dy2; dz2)From (21) the �nite di�erence formulation isu(x; y; z; t+ dt)� u(x; y; z; t)dt = 1dx2 (u(x+ dx; y; z; t) + u(x� dx; y; z; t)+u(x; y + dy; z; t) + u(x; y � dy; z; t) + u(x; y; z + dz; t)+u(x; y; z � dz; t)� 6u(x; y; z; t))Setting � = dtdx2 and taking the spatial terms on the right at time t+dt yieldsan unconditionally stable implicit schemeu(x; y; z; t) = (1 + 6�)u(x; y; z; t+ dt)� � [u(x+ dx; y; z; t+ dt)+ (22)u(x� dx; y; z; t+ dt) + u(x; y + dy; z; t+ dt) + u(x; y � dy; z; t+ dt)+u(x; y; z + dz; t+ dt) + u(x; y; z� dz; t+ dt)]JACOBI ITERATION TO COMPUTE A�1uIn order to compute solutions at successive time intervals dt we mustinvert the relationship u(t) = Au(t+dt) by solvingu(t+dt) = A�1u(t) (23)From (22) it is apparent that A has diagonal terms (1+6�) and six o�diago-nals �. Let A = (D�T ) where D is this diagonal. Then (D�T )u(t+dt) = u(t)implies u(t+dt) = D�1Tu(t+dt) + D�1u(t). This relation is satis�ed by �xedpoints of the Jacobi iterationhu(t+dt)i(m) = D�1T hu(t+dt)i(m�1) +D�1u(t) (24)The matrix D�1T has a zero diagonal and six o�diagonal terms �1+6�. D�1is a diagonal matrix with terms 11+6�. The iteration (24) is the central loopof the method (2). Note that this iteration is everywhere convergent withspectral radius de�ned by (3).UNIT IMPULSE DERIVATION 19

Page 6: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

presently considering the costs associated with such iterations.7 AppendicesThis appendix presents technical justi�cations for statements relied upon inthe analysis.FINITE DIFFERENCE FORMULATION OF THE HEAT EQUATIONConsider the parabolic heat equation in three dimensionsut = r2u = uxx + uyy + uzz (21)Taylor expanding in t with all derivatives evaluated at (x; y; z; t)u(x; y; z; t+ dt) = u(x; y; z; t) + utdt+O(dt2)ut = u(x; y; z; t+ dt)� u(x; y; z; t)dt !+O(dt)Obtain the second order terms by expanding in spatial variables where omit-ted coordinates are interpreted as (x; y; z; t)u(x+ dx; �; �; �) = u(x; �; �; �) + uxdx+uxxdx22 + uxxxdx36 + O(dx4)u(x� dx; �; �; �) = u(x; �; �; �)� uxdx+uxxdx22 � uxxxdx36 + O(dx4)u(x+ dx; �; �; �) + u(x� dx; �; �; �) = 2u(x; �; �; �) + uxxdx2 +O(dx4)uxx = u(x+ dx; �; �; �) + u(x� dx; �; �; �)� 2u(x; �; �; �)dx2 !+ O(dx2)Similar expansions in y; z show that the heat equation can be rewrittenu(�; �; �; t+ dt)� u(�; �; �; t)dt = u(x+ dx; �; �; �) + u(x� dx; �; �; �)� 2u(�; �; �; �)dx2 !+ u(�; y + dy; �; �) + u(�; y � dy; �; �)� 2u(�; �; �; �)dy2 !18

Page 7: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

generated contiguously this is presumably a simple matter because adjacentpoints will have data structures that identify their neighbors. In more generalproblems where the data is not already ordered the use of priority queuesappears promising due to their O(n log n) complexity.It is worth noting that the method can be used to rebalance a local portionof a computational domain without interrupting the computation which isoccurring on the rest of the domain. This can be useful in CFD problemswhere some portions of the domain converge more quickly than others andadaptation might occur locally and frequently.This paper presented the method and analysis for a domain which hasperiodic boundary conditions and is logically spherical. In practice multi-computer meshes are rarely periodic. In our simulations we implementedaperiodic boundaries by imposing the Neumann condition @u=@x = 0 ineach direction x; y; z. This requires a simple modi�cation to the iteration (2)so that processors immediately outside the mesh appear to have the sameworkload as processors one step inside the mesh. For example, if the proces-sors are indexed from 1 through n in the x dimension then u0;y;z = u2;y;z andun+1;y;z = un�1;y;z.The algorithm is presented for three dimensional scalable multicomputers.It reduces for two dimensional cases by rede�ning � and the iteration (2) asfollows: � = & ln�ln 4�1+4� ' � 1u(m)x;y = u(0)x;y1 + 4� + � �1 + 4���u(m�1)x+1;y + u(m�1)x�1;y + u(m�1)x;y+1 + u(m�1)x;y�1 �The worst case behavior of the method is determined by disturbances oflow spatial frequency. This is the basis of the objections in [11] and a con-ventional response would be to apply a multigrid method [11, 14]. Such amethod can have logarithmic scaling in n and O(n) convergence. The anal-ysis presented in this paper demonstrates that wall clock times can actuallydecrease as n increases and this suggests considering other methods of treat-ing the worst case disturbance. One such method would be to use very largetime steps in order to accelerate convergence of the low frequency compo-nents. The unconditional stability of this method makes this an attractiveoption. Although this would increase the error in the high frequency compo-nents these components can be quickly corrected by local iterations. We are17

Page 8: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

To be useful in practical contexts the method must be able to rebalance dis-turbances faster than they arise. This simulation (�gure 5) demonstrates thebehavior of the method on a 1,000,000 processor J-machine under demand-ing conditions. An initially balanced distribution is disrupted repeatedlyby large injections of work at random locations. Injection magnitudes areuniformly distributed between 0 and 60,000 times the initial load average.The simulation alternates repetitions of the algorithm with injections at ran-domly chosen locations. After 700 repetitions and injections the worst casediscrepancy was 15,737 times the initial load average. This is less than theaverage injection magnitude of 30,000 at each repetition. This demonstratesthe method was balancing the load faster than the injections were disruptingit. The last random injection occured on step 700. After 100 additional ex-change steps without intervening injections the worst case discrepancy hadreduced from 15,737 to 50 times the initial load average.6 Summary and DiscussionThis paper has demonstrated that a di�usion based load balancing methodis e�cient, reliable, and scalable. It has shown through rigorous theory andempirical simulation that the method is inexpensive for important genericproblems in CFD. This section discusses a few remaining points and indicatesfuture directions for this research.An important property of the method is it's ability to preserve adjacencyrelationships among elements of a computational domain. Preserving adja-cency permits CFD calculations to minimize their communication costs. Themethod preserves adjacency if it does so at each exchange step.To make this requirement concrete consider an unstructured computa-tional grid for a CFD calculation. When the time comes for the load balanc-ing method to select grid points to exchange with neighboring processors itselects points in such a way that average pairwise distance among all pointsis minimal. One way to do this is to assume that each processor representsa volume of the computational domain and to select for exchange those gridpoints which occupy the exterior of the volume. The selected points wouldtransfer to adjacent volumes where their neighbors in the computational gridalready reside. In order for this to be practical it must be inexpensive to iden-tify these exterior points. In problems where the computational grid has been16

Page 9: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

Figure 5: Disturbances resulting from rapid injection of large random loadson a million processor J-machine. After each exchange step a point distur-bance is introduced at a randomly chosen processor. The average value ofeach point disturbance is 30,000 times the initial system load average. Theinterval between successive frames is 100 exchange steps. After 700 injectionsthe worst case discrepancy was 15,737 times the initial load average. Thisdemonstrates the algorithm was balancing the load faster than disturbanceswere created. After load injection ceased an additional 100 repetitions withno new disturbance reduced the worst case discrepancy from 15,737 to 50times the initial load average.ing the adjacency constraint at each exchange step. The initial disturbancewas reduced by 90% after 6 exchange steps in exact agreement with theory(� (0:1; 512) in table 1). After 59 exchange steps the worst case discrepancywas 999 grid points. After 162 steps the worst case discrepancy was 200 gridpoints, 10% of the load average. A balance within 1 grid point was achievedon the 500th step.5.3 Random Load Injection15

Page 10: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

Figure 4: Disturbance during an initial load of a million point unstructuredcomputational grid onto a 512 processor J-machine. The �rst frame repre-sents the entire grid assigned to a host node on the multicomputer. This is apoint disturbance and the resulting behavior is in exact agreement with theanalysis presented earlier in this paper. Subsequent frames are separated by10 exchange steps. After 70 exchange steps the workload is already roughlybalanced. A balance within 1 grid point was achieved after 500 exchangesteps.5.2 Partitioning an Unstructured GridIn parallel CFD applications the static load balancing problem has beenthe subject of recent attention [3, 20]. In addition to �nding an equitabledistribution of work this problem must observe the additional constraintof preserving adjacency relationships among elements of an unstructuredcomputational grid. The load balancing method presented in this paper cansatisfy the adjacency constraint if at each exchange step it selects exchangecandidates that observe the constraint.This simulation models an initial point disturbance by assigning 1,000,000points of an unstructured computational grid to a single host node of a 512processor J-machine (see �gure 4). It executes the algorithm while observ-14

Page 11: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

Figure 3: Disturbance following a bow shock adaptation of a computationalgrid on a million processor J-machine. First frame is the initial disturbanceresulting from the adaptation. Subsequent frames are separated by 10 ex-change steps. The disturbance is reduced dramatically by the second frame.After 70 exchange steps only weak low frequency components remain.13

Page 12: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

5 SimulationsThis section presents simulations of three cases in which the method is use-ful. The �rst case is a bow shock resulting from a million grid point CFDcalculation [23]. The disturbance dissipates rapidly and is nearly gone after10 exchange steps. The second case computes an initial load distributionfor a million point unstructured grid problem on a 512 node multicomputer.The simulation suggests the method may be highly competitive with Lanc-zos based approaches presented recently in [3, 20]. The third case simulatesa multicomputer operating system under conditions of random load injec-tion. This case demonstrates that the method can e�ectively balance largedisturbances which occur frequently and randomly.Parameters for the simulations are based on two scalable multicomputers.The �rst is a 512 node J-machine [19] which is in use at Caltech for researchin CFD and scalable concurrent computing. The second is a hypothetical1,000,000 node J-machine. These two design points represent a continuumof scalable multicomputers. The method is practical at both ends of thiscontinuum and presumably at all points in between.Wall clock times are based on a hand coded implementation of the methodin J-machine assembler and assumes 32 MHz processors. Each repetition ofthe method requires 110 instruction cycles in 3.4375 �s. All simulations arerun with � = 0:1 and � = 3 resulting in 3 iterations between each exchangeof work.5.1 Bow Shock AdaptationIn CFD calculations it is common to adapt a computational grid in responseto properties of a developing solution. This simulation considers an adapta-tion that results from a bow shock in front of a Titan IV launch vehicle withtwo boosters (�gure 3). The grid has been adapted by doubling the densityof points in each area of the bow shock. As a result the initial disturbanceshows locations in the multicomputer where the workload has increased by100% due to the introduction of new points. After 10 exchanges of workthe imbalance has already decreased dramatically. After 70 exchanges thedisturbance is scarcely visible. This simulation assumes a 1,000,000 proces-sor J-machine. This example illustrates the weak persistence of low spatialfrequencies. 12

Page 13: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

Figure 1: Scaled number of exchange steps �� to achieve accuracy � for var-ious n; � following a point disturbance. See table 1 and equation (20). Eachline is scaled by �. All lines are initially increasing for small n and asymp-totically decreasing for larger n demonstrating weak superlinear speedup.Figure 2: Time course of disturbances for simulated CFD cases. Left: largestdiscrepancy among 512 processors partitioning an unstructured computa-tional grid. The initial disturbance of 1,000,000 points con�ned to a singleprocessor is reduced by 90% after 6 exchanges (20.625 �s) in exact agree-ment with theory ( � (0:1; 512) in table 1). Right: largest discrepancy among1,000,000 processors rebalancing a disturbance following a bow shock adap-tation. � = 0:1; � = 3 in both cases. Wall clock times assume a 32 MHzJ-machine. The interval between exchange steps is 3.4375 �s.11

Page 14: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

� (�; n) n (total processors)64 512 4,096 8,000 32K 256K 1060:1 7 6 6 5 5 5 5� 0:01 152 213 229 173 157 145 1410:001 2,749 5,763 10,031 10,139 9,082 7,564 7,003Table 1: Solutions to equation (20) for increasing processor count n andaccuracy �. � represents the number of exchange steps in the method. See�gure 1. = Xi;j;k c2i;j;k [1 + ��i;j;k]�� cos �2�x0i=n1=3�cos �2�y0j=n1=3� cos �2�z0k=n1=3� (18)The appendix demonstrates that ci;j;k = (8=n)1=2 and thus the disturbance isa summation of equally weighted eigenvectors. From (17) and (18) the timedependent disturbance at 0; 0; 0 is therefore~u[0; 0; 0](�dt) = 8n Xi;j;k "1 + �2 3� cos 2�in1=3 � cos 2�jn1=3 � cos 2�kn1=3!#��(19)Solving ~u[0; 0; 0](�dt) � � yields8n Xi;j;k "1 + �2 3� cos 2�in1=3 � cos 2�jn1=3 � cos 2�kn1=3!#�� � � (20)where i; j; k are indexed from 0 to �n1=3� =2 � 1 and the case i = j = k = 0is omitted.Table 1 and �gure 1 present solutions of the inequality (20). These areexact predictions of the number � of exchange steps which must occur toreduce a point disturbance by a factor � on a periodic domain. Figures 2and 3 are simulated results for two CFD cases. The �rst case of partitioningan unstructured grid is a point disturbance. In the simulation � is observed tomatch the theoretical prediction exactly. The second case of rebalancing aftera grid adaptation demonstrates the value of estimating � from simulations.The initial disturbance is not a point and the simulation is observed to require170 exchange steps before the worst case discrepancy drops to 10% of it'soriginal value. 10

Page 15: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

Assume the initial disturbance u(0) to be zero at every element except[x; y; z]. Then h~xi;j;k; ~u(0)i = ~xi;j;k[x; y; z] (13)This is equal to the initial amplitude ai;j;k(0) of each eigenvector ~xi;j;kh~xi;j;k; ~u(0)i = *~xi;j;k; Xl;m;n al;m;n(0)~xl;m;n+= Xl;m;n h~xi;j;k; ~xl;m;ni al;m;n(0)= Xl;m;n al;m;n(0)�il�jm�kn= ai;j;k(0) (14)The computational domain has periodic boundary conditions and as a resultthe origin of the coordinate system is arbitrary. Without loss of generalityplace the origin at the source of the disturbance and take x = y = z = 0.This has no e�ect on the eigenvectors ~xi;j;k and from (13), (14)ai;j;k(0) = ~xi;j;k[0; 0; 0] (15)Placing the origin at the source of the disturbance is particularly convenientwhen we consider the �rst element of the eigenvectors ~xi;j;k[0; 0; 0]. L has�n1=3� =2 distinct eigenvalues �i;j;k each of algebraic multiplicity two. Eachof these eigenvalues has geometric multiplicity eight, ie. has eight linearlyindependent associated eigenvectors of unit length~xi;j;k[x; y; z] = ci;j;kF1 �2�xi=n1=3�F2 �2�yj=n1=3�F3 �2�zk=n1=3� (16)where each Fi is either sin or cos. By choosing x = y = z = 0 this expression(16) is zero except for the single eigenvector for which F1(x) = F2(x) =F3(x) = cos(x). Without loss of generality restrict further consideration toinitial disturbances of the formu[0; 0; 0](0) = Xi;j;k ci;j;k~xi;j;k[0; 0; 0] = Xi;j;k c2i;j;k (17)From (9) de�ne the time dependent disturbance at any location x0; y0; z0~u[x0; y0; z0](�dt) = Xi;j;k ci;j;k [1 + ��i;j;k]�� ~xi;j;k[x0; y0; z0]9

Page 16: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

It is apparent from equation (9) that convergence of the individual com-ponents is dependent upon the eigenvalues �i;j;k. Reducing the amplitudeof an arbitrary component i; j; k by � in � steps of the method requires[1 + ��i;j;k]�� � �. The worst case occurs for the smallest positive eigen-value �0;0;1 = (2� 2 cos(2�=n1=3)) which corresponds to a smooth sinusoidaldisturbance with period equal to the length of the computational grid. Toreduce such a disturbance requires� = 2666 ln��1ln h1 + � �2 � 2 cos 2�n1=3 �i3777 (10)Convergence of this slowest component approaches ln��1 for large n sincelimn!1 ln �1 + ��2 � 2 cos 2�n1=3�� = 1Convergence of highest wavenumber component �(n1=3)=2�1;(n1=3)=2�1;(n1=3)=2�1is rapid because � = & ln��1ln [1 + (6� �)�]' (11)ANALYSIS OF A POINT DISTURBANCEThis section considers a speci�c case of a point disturbance and derives aninequality relating �; n and �. The purpose in doing this is to provide an exactprediction of convergence of a known case in order to demonstrate scalingproperties. The procedure followed is to describe the initial amplitudes of thecomponents of the disturbance and then solve an inequality which describesthe magnitude of the disturbance over time. A periodic domain is assumedfor the purpose of analysis. Simulations presented later in this paper verifythat convergence is similar on aperiodic domains.The following text uses the Poisson bracket h�; �i to represent the innerproduct operator. When discussing loads or eigenvectors it uses ~u[x; y; z]or ~xi;j;k[x; y; z] to denote the vector element which corresponds to locationx; y; z of the computational grid with the convention that [0; 0; 0] is the �rstelement of the vector. Then the initial disturbance con�ned to a particularprocessor x; y; z can be written as a superposition of eigenvectors of L~u(0) = Xl;m;n al;m;n(0)~xl;m;n (12)8

Page 17: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

ELAPSED TIME FOR THE DIFFUSIONThis section determines the number of arti�cial time steps � required toreduce the load imbalance by a factor �. It does this by considering theeigenstructure of the �nite di�erence equation (22) which is rearranged toexpress the change in load with each iterationux;y;z(t+ dt)� ux;y;z(t) = � [ux+1;y;z(t+ dt) + ux�1;y;z(t+ dt)+ux;y+1;z(t+ dt) + ux;y�1;z(t+ dt)+ux;y;z+1(t+ dt) + ux;y;z�1(t+ dt)�6ux;y;z(t+ dt)]or as a vector equation with matrix operator L~u(t+ dt)� ~u(t) = �L~u(t+ dt) (6)Any load distribution ~u(t) can be written as a weighted superposition ofeigenvectors ~x of L ~u(t) = Xi;j;k ai;j;k(t)~xi;j;kUsing this fact rewrite the vector equation (6) asXi;j;k ai;j;k(t+ dt)~xi;j;k �Xi;j;k ai;j;k(t)~xi;j;k = �Xi;j;kLai;j;k(t+ dt)~xi;j;k (7)Using the de�nition of L~xi;j;k and the eigenvalues of LL~xi;j;k = ��i;j;k~xi;j;k�i;j;k = 2 h3� cos �2�i=n1=3�� cos �2�j=n1=3�� cos �2�k=n1=3�i (8)(7) can be further simpli�ed toXi;j;k (ai;j;k(t+ dt)~xi;j;k [1 + ��i;j;k]� ai;j;k(t)~xi;j;k) = 0and by the completeness and orthonormality of the eigenvectorsai;j;k(t+ dt) [1 + ��i;j;k]� ai;j;k(t) = 0ai;j;k(dt) = ai;j;k(0)1 + ��i;j;k (9)7

Page 18: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

nential rate. It demonstrates scalabilty by de�ning a lower bound on this rateas a function of �; n. It applies the resulting theory to the case of a point dis-turbance and demonstrates that weakly superlinear speedup can be achievedunder realistic assumptions. The appendix contains technical derivations insupport of this analysis.ACCURACY OF THE JACOBI ITERATIONSince the algorithm is intended to observe strict accuracy of O(�) it is im-portant to verify that each stage of the algorithm observes this accuracy.The justi�cation of accuracy for the �nite di�erence equation is given in theappendix. The accuracy of the coe�cient matrix inversion can be veri�ed byanalyzing the spectral radius of the Jacobi iteration.From the Ger�sgorin disc theorem [7] the eigenvalues � of (2) are boundedj�j � 6�1+6�. Since the row and column sums are constant and the iterationmatrix is nonnegative ([7], theorem 8.1.22) the spectral radius equals the rowsum � �D�1T� = 6�1 + 6� (3)De�ne the error in a current value ~u(m) under the iteration (2) as e(~u(m)) =(~u(m)�~u�) where ~u� is the �xed point of the Jacobi iteration. Then for � > 0e(~u(�)) = e((D�1T )�~u(0)) = (D�1T )�e(~u(0)) (4)which converges when �(D�1T ) < 1 since �((D�1T )�) = (�(D�1T ))�. Inorder for the algorithm to correctly reduce the error it is necessary to computethe desired load at each time step to an appropriate accuracy. In order toquantify the error de�ne the in�nity normke(~u(m))k1 = maxx;y;z ���e(u(m)x;y;z)��� = maxx;y;z ���u(m)x;y;z � u�x;y;z���Using this norm de�ne a necessary condition to improve the accuracy of thesolution ~u by a factor � in � steps to be ke(~u(�))k1 � �ke(~u(0))k1. From (4)this is satis�ed when (�(D�1T ))� � � and thus (1)� = & ln�ln �(D�1T )' = & ln�ln 6�1+6� ' (5)The method is unconditionally stable and the cost of this stability is small(in fact it is free for 0:833 � � < 1). This suggests the possible use of largetime steps to deal with worst case disturbances.6

Page 19: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

� = 8>>><>>>: 2 : 0 < � � 0:04453 : 0:0445 < � < 0:6222 : 0:622 � � < 0:8331 : 0:833 � �3.2 ExecutionAt every processor x; y; z adjust the workload ux;y;z as follows:||||||||||||||||||||||||||{u(0)x;y;z = ux;y;zfor m=1 to �u(m)x;y;z = u(0)x;y;z1 + 6� + � �1 + 6���u(m�1)x+1;y;z + u(m�1)x�1;y;z+ (2)u(m�1)x;y+1;z + u(m�1)x;y�1;z + u(m�1)x;y;z+1 + u(m�1)x;y;z�1�endforExchange �u(�)x;y;z � u0(�)�� units of work with every neighbor u0.ux;y;z = u(�)x;y;z||||||||||||||||||||||||||{Repeat these steps until reaching equilibrium. Much of the followinganalysis will be concerned with formulating an exact statement of the numberof repetitions required to reach equilibrium. The motivation behind thisanalysis is to support claims of correctness, convergence, and accuracy. Theanalysis develops a theory of convergence for any disturbance and appliesthis to the speci�c case of a point disturbance. Analysis appears to be lesspractical in many cases than conservative estimates derived from simulations.Following the analysis this paper presents simulations from cases of interestin CFD and distributed operating systems. The paper concludes with asummary and discussion.4 Reliability and ScalabilityThis section demonstrates reliability of the method by showing that for anyinitial disturbance every component of the disturbance vanishes at an expo-5

Page 20: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

properties of the parabolic heat equation ut��r2u = 0. The heat equationdescribes a process in nature whereby thermal energy di�uses away from hotregions and into cold regions in a volume until the entire volume is of thesame temperature. A literal interpretation of the terms in the heat equationreads that the rate of change in temperature u at each point in the volumeis determined by the local curvature r2u times a di�usion rate �. Thispaper applies �nite di�erence techniques to derive an unconditionally stablediscrete form of this equation, and uses a scalable iterative method to invertthe resulting coe�cient matrix. The end result is a concurrent load balancingmethod which is scalable, reliable and e�cient.3 A Parabolic AlgorithmThe algorithm consists of a simple arithmetic iteration which is performedconcurrently by every processor in the multicomputer. Each step of theiteration requires 7 oating point operations at each processor. The iterationcalculates an expected workload at each processor. Processors periodicallyexchange units of work with their immediate neighbors in order to make theiractual workload equal to this expected workload. The algorithm containsparameters which control the accuracy of the resulting solution. In manyapplications an accuracy of 10% is su�cient. In these cases only 24 iterationsare required to reduce a point disturbance by 90% regardless of the size of themulticomputer. This paper presents the algorithm for a three dimensionalmesh. The reduction to two dimensions is described in the discussion section.3.1 InitializationSpecify the accuracy � desired in the resulting equilibrium. For example, tobalance to within 10% choose � = 0:1. Determine the interval � at whichprocessors will exchange work with their immediate neighbors according tothe formula � = & ln�ln 6�1+6� ' � 1 (1)Note that in the interval 0 < � < 1 � is less than or equal to 3:4

Page 21: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

multicomputers should be scalable but should not sacri�ce reliability.It is worth noting that a class of random placement methods have beenproposed for scalable multicomputers [2, 10]. These methods are scalableand are reliable under the assumption that disturbances occur frequentlyand have short lifespans. These assumptions do not hold in a domain likeCFD where disturbances arise occasionally and are long lasting. As a resultwe are unable to take advantage of the rewards these methods o�er.A method is reliable if it can be shown to compute a correct solutionwithin a predictable number of steps. The simplest reliable load balancingmethod collects load statistics from all processors, computes the average load,and broadcasts the average to all processors. Each processor then exchangeswork with it's neighbors so that the new workloads equal this average. Un-fortunately this simplest method is not scalable because it is inherently serial.The number of terms in the calculation of the average load is proportionalto the number of processors in the computer system. Although this cost canbe made logarithmic with an octree data structure there is a more severecost associated with interprocessor communication. The current state of theart in mesh routing technology requires a noncon icting communication pathfor each message. The opportunities for path con icts known as \blockingevents" increase factorially with the number of processors in the computersystem. This simplest reliable method is not scalable because the time lostto blocking events can grow factorially with the size of the computer system.A method is scalable if it can run e�ciently on computer systems withvery large numbers of processors. Due to the e�ects of Amdahl's law [1]most scalable methods are concurrent algorithms in which the computationis distributed among the processors in the computer system. What thesemethods gain in scalability they often lose in reliability because they lackformal proofs of correctness and convergence.As one example of the problems which can arise in concurrent algorithmsconsider a simple concurrent method in which each processor adjusts it's loadto equal the average of the loads at it's immediate neighbors. This method isdistributed and scalable and is easily seen to be convergent. Unfortunately itis well known that it converges to solutions of the Laplace equation r2� = 0.This equation is known to admit sinusoidal solutions which are not equilibria.As a result this method, although scalable, is not reliable.This paper leverages existing numerical and analytic techniques to derivea reliable and scalable load balancing method. The method is based on3

Page 22: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

time) balancing.The method assumes the computation is su�ciently �ne grained thatwork can be treated as a continuous quantity. This is reasonable in anapplication domain like computational uid dynamics in which units of workrepresent grid points in a simulation. Each processor is typically responsiblefor thousands of grid points and can exchange individual points with otherprocessors.Cybenko [6] has published a method which is similar in several respectsto this one. He proves asymptotic convergence of an iterative scheme onarbitrary interconnection topologies. Two other articles have appeared morerecently on di�usive load balancing methods. Boillat [4] demonstrates poly-nomial convergence on arbitrary interconnection topologies using a Markovanalysis. Horton objects to the polynomial convergence demonstrated in[4] and the lack of bounds on accuracy in the previous work. He appliesa multigrid concept [11, 14] to accelerate convergence of a simple di�usivescheme.The method presented in this paper, while developed independently, re-sembles a special case of Cybenko's method restricted to mesh connectedtopologies. It di�ers from Cybenko's method with regard to issues of numer-ical stability. The objections in [11] lack rigor. This paper refutes them viaformal demonstrations of convergence rates, accuracy, and scaling. Theseresults show that convergence is rapid, accuracy is limited only by machineprecision, and superlinear speedup can be achieved for cases of practical in-terest in CFD.2 A Parabolic ModelA number of articles have proposed solutions to the load balancing problemin recent years [2, 5, 10, 12, 13, 15, 17, 18, 22]. Many of these solutions arereliable and e�cient for computer systems with small numbers of processors.Unfortunately many of them do not scale well to systems with large numbersof processors. It is well known that in a scalable algorithm the amount of workperformed in parallel grows more rapidly than the amount of work performedin the serial part of the calculation as the size of the problem increases [1, 9].Scalable algorithms tend to be highly concurrent and this fact often makes itdi�cult to prove that they are correct. A load balancing method for scalable2

Page 23: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

1 IntroductionThe scale of scienti�c applications and the computers on which they run isgrowing rapidly. Disciplines such as particle physics and computational uiddynamics pose numerous problems which until recently have exceeded thememory and cpu capacities of existing computers. Moreover these disciplinesappear ready to pose new problems which exceed the capacities of the largestcomputers that will be built in this decade. The Grand Challenges representa set of problems solvable by computers with TeraFlops (109) performance.If present trends continue these will be supplanted within a decade by a setof challenges requiring PetaFlops (1012) performance.This increase in problem size is made possible by the dramatic reductionsin size and cost of VLSI technology and parallel computing. Currently ourresearch in computational uid dynamics uses scalable multicomputers withroughly 500 processors [19, 23, 24]. Research e�orts are underway to developwithin three years time scalable multicomputers with hundreds of thousandsof functional units [16, 21]. These trends suggest that limits to the growthof scienti�c computing are more likely to be determined in the next severalyears by software technology than by hardware limitations.A potentially limiting software technology is the class of methods to solvethe load balancing problem. Most numerical algorithms require frequent syn-chronization. If a load distribution on a multicomputer is uneven then someprocessors will sit idle while they wait for others to reach common synchro-nization points. The amount of potential work lost to idle time is proportionalto the degree of imbalance that exists among the processor workloads. Sincethis loss also increases with processor count it can be valuable to control theaccuracy of the resulting balance and to trade o� the quality of the balanceagainst the cost of rebalancing.This paper presents a load balancing method for mesh connected scalablemulticomputers with any number of processors. The method can balance theload to an arbitrary degree of accuracy. Theory and experiment show themethod is inexpensive under realistic conditions. While it is e�ective on con-temporary systems with under 1,000 processors it is speci�cally intended toscale to systems with tens and hundreds of thousands of processors. Thetotal wall clock time for the method decreases as the processor count in-creases. The method was developed to solve the dynamic (run time) loadbalancing problem. It has also proven useful for cases of static (initial load1

Page 24: Kec - pdfs.semanticscholar.org · [16] Kec kler, S. W. & Dally, J. Pro cessor coupling: in tegrating compile time and run sc heduling for parallelism. Pr o c. A CM 19th Int'l. Symp

A Parabolic Load Balancing Method 1Alan Heirich & Stephen TaylorScalable Concurrent Programming LaboratoryCalifornia Institute of TechnologyAbstractThis paper presents a di�usive load balancing method for scal-able multicomputers. In contrast to other schemes which are provablycorrect the method scales to large numbers of processors with no in-crease in run time. In contrast to other schemes which are scalable themethod is provably correct and the paper analyzes the rate of conver-gence. To control aggregate cpu idle time it can be useful to balancethe load to speci�able accuracy. The method achieves arbitrary accu-racy by proper consideration of numerical error and stability.This paper presents the method, proves correctness, convergenceand scalability, and simulates applications to generic problems in com-putational uid dynamics (CFD). The applications reveal some usefulproperties. The method can preserve adjacency relationships amongelements of an adapting computational domain. This makes it use-ful for partitioning unstructured computational grids in concurrentcomputations. The method can execute asynchronously to balance asubportion of a domain without a�ecting the rest of the domain.Theory and experiment show the method is e�cient on the scal-able multicomputers of the present and coming years. The numberof oating point operations required per processor to reduce a pointdisturbance by 90% is 168 on a system of 512 computers and 105 ona system of 1,000,000 computers. On a typical contemporary multi-computer [19] this requires 82.5 �s of wall-clock time.1The research described in this report is sponsored primarily by the Advanced Re-search Projects Agency, ARPA Order number 8176, and monitored by the O�ce of NavalResearch under contract number N00014-91-J-1986.