dynamic load balancing in parallel and distributed
TRANSCRIPT
Dynamic Load Balancing in Parallel and Distributed Networks
by Random Matchings
(Extended Abstract)
Bhaskar Ghosh*
Abstract
The fundamental problems in dynamic load balancing
and job scheduling in parallel and distributed comput-
ers involve moving load between processors. In this
paper, we consider a new model for load movement
in synchronous parallel and distributed machines. In
each step of our model, each processor can transfer
load to at most one neighbor; also, any amount of
load can be moved along a communication link be-
tween two processors in one step. This is a reason-
able model for load movement in significant classes of
dynamic load balancing problems.
We derive efficient algorithms for a number of task
reallocation problems under our model of load move-
ment. These include dynamic load balancing on pr~
cessor networks, adaptive mesh re-partitioning such as
those in finite element methods, and progressive job
migration under dynamic generation and consump-
tion of load.
To obtain the abov~mentioned results, we intro-
duce and solve the abstract problem of Incremental
Weight Migration (IWM) on arbitrary graphs. Our
main result is a simple, randomized, algorithm for this
problem which provably results in asymptotically op-
timal convergence towards the state where weights on
the nodes of the graph are all equal. This algorithm
*Department of Computer Science, Yale University, P. O. Box208285, New Haven, CT 06520. Internet : [email protected].
Reseerch supported by ONR under grant number 491-J-1576
and a Yale/IBM joint study.
t Courant Institute of Mathematical Sciences, New York
University, 251 Mercer Street, New York, NY 10012-1185, USA;
[email protected], (212) 998-3061. The research of this authorwas supported in part by NSF/DARPA under grant number
CCR-89-06949 and by NSF under grant number CCR-91-03953.
Permission to co y without fee all or part of this material iscl’~ranted provide that the copies are not made or distributed for
direct commercial advantage, the ACM copyright notice and thetitle of the publication and its date appear, and notice is giventhat copying is by permission of the Association of ComputingMachinery. To copy otherwise, or to republish, requires a feeand/or specific permission.
SPAA 94-6194 Cape May, N.J; USA@ 1994 ACM 0-89791-671 -9/94/0006..$3.50
S. Muthukrishnant
utilizes an appropriate random set of edges forming
a matching. Our algorithm for the IWM problem is
used in deriving efficient algorithms for all the prob-
lems mentioned above.
Our results are very general. The algorithms wederive are local, and hence, scalable. They work for
arbitrary load distributions and for networks of arbi-
trary topology which can possibly undergo link fail-ures. Of independent intereat is our proof techniquewhich we use to lower bound the convergence of ouralgorithms in terms of the eigenstructure of the un-derlying graph.
Finally, we present preliminary experimental r-sults analyzing issues in load balancing related to ouralgorithms.
1 Introduction
Consider the following scenario of dynamic load bal-ancing in a distributed setting. An application pro-
gram is running on a distributed network of arbitrary
topology comprising a large number of processors.
Each processor has a load of independent tasks tobe executed. The distribution of tasks is dynamicallydetermined, that is, the specific application programrunning on the machine cannot be developed with a-
priori estimates of the load distribution. The taskof dynamic load balancing is to reallocate the tasksso that each processor has nearly the same amount ofload. Of course in natural settings, the scenario is sub-stantially more demanding in that the tasks might bedynamically generated or consumed in each step and
additionally, the underlying topology might changeowing to failures in communication links. Besidesload balancing, scenarios such as the one above occurin several other guises, for example, in job schedul-
ing, adaptive mesh partitioning and resource alloca-tion problems. (These guises will be more clearly ex-plained later with examples). In each of these guises,equitable load redistribution is critical for efficient im-plementation of algorithms on both distributed andparallel computers.
Standard models for dynamic load balancing make
226
the following assumptions. (See for example [AA+93,
LM93, R91].) In one time step, each processor can
migrate load to any (possibly all) of the other pro-cessors (possibly including non-neighbors). Also, at
most one unit of load can be transferred across anylink in a step. Under this model, dynamic load bal-ancing has been extensively studied. The main focusof this paper concerns the amount of parallelism inthis standard model. Specifically,
1. The standard model overestimates available par-allelism in communication with neighbors. In
practice, in each time step, each processor cancommunicate with only one other processor.That is, the communication with a set of neigh-
boring processors is inherently sequential.
2. The standard model underestimates availableparallelism in edge capacity. With increasingly
high–bandwidth networks becoming available, alarge amount of data can be transferred acrossa link in one time step. Therefore, it is reason-
able to assume that several units of tasks canbe migrated in the same message across a linkin one step, provided moving each task incursmovement of only a reasonable amount of data.Indeed, there are large classes of important dy-namic load balancing problems in which the taskshave small associated data space (Examples areprovided in Section 1.3).
Motivated by these observations, we study dy-
namic load balancing (and other guises in which itcomes up) in distributed networks with unboundededge capacity under the restriction that each proces-
sor can migrate load to at most one of its neighbors
in one step. Our approach is to identify an abstractproblem which we call the Incremental Weight Mi-
gration Problem on arbitrary graphs. This can bethought of as a single step of load migration acrossthe entire network in parallel on our model. Our mainresult is an asymptotically optimal algorithm for this
problem on our model. We utilize this in deriving ef-ficient algorithms for many other problems, includingdynamic load balancing, adaptive mesh partitioning,
and dynamic job scheduling.
Our algorithms employ only local control and datacommunication; hence, they are scalable. Also, ourresults are very general, in that they hold for networksof arbitrary topology which can possibly undergo linkfailures during the execution of the algorithm.
1.1 Problem and Model
Incremental Weight Migration (IWM) Prob-
lem. Consider an undirected connected agent graph
G = (V, E) with n vertices in which each vertex vi
representing an agent Ai has weight wi. The poten-tial @ of the graph is defined to be (xi w?) – nZJ2,
where n is the number of nodes in G and ~ = ~i wi/n
is the average weight. 1 The IWM problem is to de-
termine a set of matching edges M (that is, no pairof edges share an endpoint) and to specify, for eachedge in M, a relocation of weights on its ends across
the edge such that the drop in the potential function
is the maximum.
Model. We note three characteristics of the modelin the definition of the problem above. First, weightmovements are local, i.e., any portion of the weight ona node which is moved, ends up at a neighbor of that
node. Second, any portion of the weight on an end-
point of an edge can be moved across the edge in onestep. Third, each agent is involved in weight transferwith at most one of its neighbors.
Desirable Algorithmic Features. Naturally wewould like an algorithm of reasonable computational
expense. This is particularly relevant since tradition-ally, load balancing and scheduling algorithms trade-
off performance for running time. Importantly, wewould like our algorithm to be fully local and dis-tributed. This is critical because algorithms whichneed and rely on global information are expensive;additionally they may not work when links in the un-derlying network graph fail, as may happen in practicefor distributed networks.
1.2 Our Main Results
Our performance measure is the ratio of the drop inpotential to the original potential; we denote this as
the convergence factor. We assume that the agents
(processors) work in lock-step.
A. Real Weight Case. First, consider the case whenthe weights are real. This implies that the weight oneach agent can be subdivided to arbitrary precision.In this case, we design a simple, completely local, ran-
domized algorithm for the IWM problem which has anexpected convergence factor of at least c+. Here, A2is the second smallest eigenvalue of the Laplacian ofthe agent graph G, d is the degree of G, and c is a
constant (O < c s 1), independent of G.
It is essy to show that there exists an agent graph
and an assignment of weights to its vertices suchthat no algorithm (even one which is randomizedand which has global information) can have a conver-
gence factor greater than CZ$, where C2 is a constant.
Therefore, our algorithm is asymptotically optimal.
1 ow po~en~j~ fwctjon @ j~ the quwe of the Euclidian&-
tance between the weight vector O = (WI, . . . . Wn)= and the
load-balanced vector (=,iiJ,, . . ..@T. Note that@ ~ 0. The PO-tential becomes zero only when the weights on the agents are
all equal to 7i7.
227
B. Discrete Weight Case. In real applications, theweights are discrete, i.e., each Wj is a collection of Wiunit tasks and a unit task cannot be divided further.We show that a simple modification of the algorithmin Case A also works when the weights are discrete.Its expected convergence factor is at least C* where c
is a constant (O < c $ 1) provided the initial potentialis f2(n4).2 This algorlthm has an optimal convergence
factor w well.
The discrete case is intrinsically harder than thereal weight case when the potential is small (o(n4))since we can show the following: there exists an agent
graph and assignment of weights to the agent nodessuch that no algorithm can have a convergence factorof more than ~~, for any constant ~. In contrast,the algorithm in Case A has a convergence factor of
Q(A2/2cf).3 On the other hand, when @ = 0(n4), wecan show that our algorithm reduces @ by an additiveterm rather than by a multiplicative factor.
Remark 1. It is worth noting that although weights
are moved only along a subset (matching) of edges,
the convergence bounds are in terms of the global
properties of the graph, namely, ~2 and d. Note thatfor any connected graph, O < ~ ~ 1. Thus, our algo-
rithms guarantee a positive fractional (possibly non-constant) decrease in the potential for any connectedgraph.
Remark 2. The parameter J2 reflects the connec-tivity of the underlying graph. For a line graph, a d-
dimensional mesh, a hypercube, a d-regular expanderand a clique on n nodes, the fraction ~ roughly equals
respectively.
Remark 3. Our algorithms are extremely efficientsince each agent takes only O(d) time for control.
Also, each agent performs data transfer across at mostone edge in each time step.
Remark 4. The performance of our algorithm isguaranteed even if some edges disappear between suc-cessive time steps; in this case, d represents the de-
gree of the graph at the beginning of the algorithmand ~2 represents the second smallest eigenvalue ofthe Laplacian of the graph that remains at the end of
the algorithm. The proof of this claim is omitted in
this paper.
2~ fact we prove this convergence factor when the initial
potential is Q(dn/A2 ). Since A2 = Q(l/n2) and d < n for auyconnected graph, it follows that the claimed convergence factorholds when the initial potential is fl(n4 ).
3Throughout this paper, we use Q(j), for a given function
j, to denote cj for some constant c.
1.3 Applications and Our Other Re-
sults
We utilize our algorithm for the IWM problem to de-rive efficient algorithms for a variety of task realloca-
tion problems. In what follows, we briefly introducethree major applications, and defer the other applica-tions to our detailed paper.
1.
2.
3.
We provide an efficient and first-known com-pletely analyzable algorithm in our model for dy-
namic load balancing on arbitrary networks underpossible link failure.
We provide the first-known, analytical conver-gence result for abstract dynamic re-partitioningproblems (eg. re-partitioning an adaptivelychanging mesh such as those used in finite ele-
ment methods).
We initiate a new paradigm of progressive taskscheduling, in which even under dynamic gen-eration and consumption of load, at each taskscheduling step, a fractional progress is made to-wards the load-balanced state.
Dynamic Load Balancing. Given a processor net-
work with arbitrary discrete weights on the vertices,
the dynamic load balancing problem is to move theloads as to have nearly the same amount of load oneach processor. Dynamic load balancing has beenstudied in a number of settings, Almost all re-
search has focused on algorithms for specific topolo-gies and/or rely on global routing phases. A classof such research has involved performance analysis ofload balancing algorithms by simulations [LMR91].Among analytical results, load balancing for specifictopologies under statistical assumptions on input load
distributions has been studied [HCT89]. For arbitraryinitial load distribution, load balancing has been stud-ied in special topologies such as Counting Networks
[AHS91, HLS92] Hypercubes [P89], Meshes [HT93]and Expanders [PU89]. These algorithms do not ex-tend to arbitrary or dynamically changing topologies.
For dynamically changing topologies, load balancinghas been studied under assumptions on the pattern offailures for specific architectures [R89, AB92].
For arbitrary topologies, under the assumption
that one load unit can be migrated across each edgein parallel and that each processor can communicatewith all its neighbors in one step, [AA+93] presents
an algorithm for dynamic load balancing which takesO(A log(nA)/p) steps to approximately balance theloads. The approximation is within an additive termof d x diameter(G). Here p is the vertex expan-sion of the graph and A = maxi (Wi – iD), where
~ = ~i w;/n. Their algorithm is optimal up to alog nA factor.
228
In our model of dynamic load balancing, load canbe moved along only one edge from a processor in atime step, and there is no restriction on the amount of
load that can be moved. This is an appropriate modelwhen each task has small associated data space; there-
fore several tasks can be communicated in one time
step. This is true for a large class of problems likefine grain programs which spawn processes dynam-
ically [GH89, K88], real time data fusion problems[CA87, FG91] and game tree searches [F93].
Clearly, the dynamic load balancing problemfor discrete weights can be solved by applyingour algorithm for the IWM problem repeatedly.
This algorithm balances load approximately in
0((d/A2)(log 00 + dn)) invocations of our algorithm
for the IWM problem. The load balancing is approx-
imate in the sense that our algorithm stops when for
each edge (i, j), I wi - wj IS 1. Our algorithm worksfor arbitrary topologies under possible failure of links
connecting the processors.
We remark that Cybenko [C89] considered astronger version of our model by additionally allowingeach processor to transfer load to all its neighbors inone time step. This work is of mathematical interest
since it considers only the case when the weights arereal.
Problem Re-Partitioning. Abstractly, assume
each node in the agent graph corresponds to a par-
tition or sub-domain of a global data domain. Each
node in the agent graph is mapped to a processor.Due to dynamic computations at each processor, thesub-domains get refined, leading to a load-imbalance
in the size of each sub-domain. Repartitioning of thedomain becomes necessary to achieve load balance.
Such applications come up in various forms inthe use of adaptive finite-element and finite-differencemethods using either locally adaptive meshes or or-der of approximation, for example in h-p jinite ele-
ment methods which are common in mechanical en-gineering and visualization software. In adaptive-
mesh terminology, the agent graph representing sub-domain connectivity information is called the quo-
tient graph. Achieving balanced sub-domains usu-ally involves shifting the boundaries of adjoining sub-domains (i.e., across edges in the quotient graph) soa to equalize the data points in each sub-domain.
Further references on these areas can be found in[BB87, HT93, W91].
Clearly, our algorithm for the IWM problem canbe used repeatedly on the quotient graph to solve the
problem of mesh partitioning. Note however that theactual migration of data points as determined by theapplication of our algorithm can be performed on theunderlying architecture by either local communication
(if adjoining sub-domains have been mapped to ad-joining processors) or by non-local routing (if adjoin-
ing sub-domains have been mapped to non-adjoiningprocessors).
Progressive Dynamic Task Scheduling. Con-sider a segment of a distributed execution, where tasks
are generated and consumed in each step in an unpre-dictable manner at various nodes as the computation
proceeds. We are required to schedule the tasks in
each step by moving them to underloaded or idle pro-
cessors so as to increase the throughput. This sce-nario arises in general purpose distributed computing[LK87, NX+85] as well as in specific applications suchas the parallel branch-and-bound search on game trees
[KZ88] and dynamic tree embedding on distributedor parallel architectures [L N+89, R91].
We initiate a new paradigm for these problems.For motivation, note that there are broadly twoparadigms for task scheduling in this scenario. Inone paradigm, the scheduling guarantees that each
processor has at least one task to execute at the endof the step. In the other paradigm, the schedulingguarantees that all processors have roughly the same
number of tasks at the end of the step, It is easy to seethat in both these paradigms, there exists a sequence
of load generation and consumption that forces any
algorithm to resort to load movement between twonon-neighboring processors to satisfy the guaranteeon load distribution. Load movement between pro-cessors which are not neighbors is an expensive op-eration. We advocate the approach of restricting al-gorithms to only perform load movements between
neighbors, but requiring a guarantee of a reasonableprogress towards the load-balanced state. In our case,the reasonable progress is a decrease in the distance
to the load-balanced state (formalized as a potential
function) by a multiplicative factor. We refer to thisas progressive dynamic task scheduling.
Using our algorithm for the IWM problem onceeach step in this case of dynamic load migration withloads generated and consumed each step, we can guar-antee that the potential drops by at least a A2/16dfactor in the expected case in each step as long as
the potential is large. Note that our algorithm isthe first known algorithm to make such a guaran-tee. In a related work [AA+93], (under the weaker as-
sumption that tasks are not dynamically generated orconsumed), a distance function (different from ours)is used to measure the progress towards the load-balanced state in several steps. However, they cannotguarantee a fractional decrease in the distance in ev-ery step since their argument involves amortization ofthe decrease in the distance over several steps.
1.4 Our Techniques
For intuition, consider solving the IWM problem withreal weights. Given a graph which is not balanced,
229
we can always pick an edge (i, j) and equalize the
weights across its endpoints. Note that since provably
decreases @ since the reduction in @ is (w? + w;) –
2(w’~wj )2, which is ~(~i–~j)2 z O. We speed up thisprocess by balancing along a matching set of edges inparallel, Note that a set of matching edges can beobtained in several ways. For example, edge-coloringthe input graph gives us a set of matchings where eachcolor defines a matching. Alternately, given graph G,we can explicitly compute the matching which gives
the maximum potential drop. All these schemes re-quire expensive computation of global information;also, they may not work when some edges disappear.
In our algorithm, we choose a random set ofmatching edges locally. The manner in which the ran-
dom matching is chosen ensures that there is a globallower bound on the probability of each edge appearing
in the matching. This property ensures global conver-
gence bounds. For choosing such a random matching,we draw upon the intuition from the very sparse phasein the evolution of random graphs [B87].
There appears to be some connection betweenour techniques for analyzing our algorithm and thoseused in analyzing rapidly mixing properties of MarkovChains [M89, AA+93]. However no formal connectionis known to us.
1.5 Organization of the Paper
The IWM problem for real weights is solved in Sec-tion 3. We extend this solution in Section 4 to the casewhen the weights are discrete. In Section 5 we demon-
strate one of the applications – namely, dynamic loadbalancing. The rest of the applications are omitted inthis paper. In Section 6 we present some preliminaryexperimental results from the implementations of ouralgorithms.
2 Preliminaries
Consider an undirected connected agent graph G =
(V, E) with n vertices and maximum degree d. Eachvertex vi represents an agent Ai and has weight Wi.We denote the distribution of weights wi on node iof G by the weight vector d. The potential @ of the
graph is defined to be (xi w?) – nZ2, where n is thenumber of nodes in G and G = ~i wi/n is the averageweight. Given G and $ and any algorithm for I WM,
let the potentials before a?d after the invocation ofthe algorithm be @ and @ . Then the convergence
factor for this algorithm is defined to be (Q – @ )/0.
Let A denote the adjacency matrix of G. Definea matrix D = di,i, where di,j = O if i # j, and di,i
be the degree of agent i. The matrix L = D – A is
the Laplacian Matrix of G. The eigenvalues of L are
o= Al< A2< .... <An.
Fact 1. G is a connected graph if and only if ~2 >0.It can be shown that for any connected graph with nvertices, AZ = Q(l/n2),
Fact 2. From the Courant-Fischer Minimax Theo-rem, it follows that [MP92],
wherevl = (1,1, ..., 1)~ is the eigenvector correspond-
ing to Al = O and z -1-VI implies that the vector z isorthogonal to V1.
3 Algorithm for IWM (Real
Weights)
In this section we present a local randomized alg~rithm for the IWM problem with real weights. Given
a graph G and weight vector d, Algorithm LocalRan-dom (LR) works as follows :
1. Pick a random matching M in G aa follows:
a. Each edge e is independently put in lbl with
probability I/p (p will be fixed later).
b. Each edge (u, v) removes itself from A4 if
(w, u) or (w, v) is in Al for some w 6 V.
2. For each edge (i, j) c M, (assuming without loss
of generality wi ~ wj ), move (wi – wj )/2 loadunits from agent 2 to agent j.
Lemma 1 For each edge e = (u, v), Pr ( Algorithm
LR picks e in ill ) z l/8d.
Proof. Fix an edge e = (u, v). Now,
Pr( e is in M in Step la and removed in Step lb)
< 12(d–1)
-i P
Therefore,
Pr(e in J4 after Step lb)
= Pr(e in Jf after Step la and it is not removed in
Step lb)
= Pr(e in &f after Step la) -
Pr(e is in A4 after Step la and removed in Step lb)
>~_~2(d–l)>p–2d+2—
PPP– P2
Now set p = 4d. Then, Pr(e in J4 after Step 2) z ~
~ ~. m
230
Theorem 1 For any connected graph G, and weight
vector$, the expected value of the convergence factor
CLR of Algorithna LRit atleast~.
Proof : Let AQbethe drop intotal potentialof G
due to algorithm LR. For each edge (i, j) in Ai,j be the
drop in potential due to weight equalization between
agents i and j if (i, j) is in the matching picked by LR.
Clearly, Ai,j = W: + W; – 2( ’”’~wz)2 = (Wi – W~)2/2.
E(AO) = ~ Pr( Edge (i, j) in M)* Ai,j
(i,j)~E
= ~ Pr( Edge (i,j) in
(i,j)~E
~lJWi-Wj)2
(i~JZ 8’ 2
> Z(i,j)~E(wi - ‘j)2
—16d
~)* (Wi - tOj)’~
2
(Lemma 1)
Recall that the convergence factor is A@/@. Note
that @ = (~i~~ w?) - nTiJ2 = ~i~v(Wi -liJ)2.
(1 ~(i,j)cE ((wi - ~) - (Wj - ~))’
‘S ~i~v (Wi - ID)2)
Define z to be a vector of length n with elements
Xi =Wi — ID. Substituting,
E(CLR)
Since ~i xi = O, x is orthogonal to the first eigen-
vector VI = (1,1, . . . . . l)T of the Laplacian matrix.
From Fact 2 in Section 2 it follows that :
Remark 1. Note that in Algorithm LR each agent
uses the global maximum degree d. The algorithm
can be easily modified such that each agent uses only
its own degree and the degrees of its neighbors. The
convergence bound of Theorem 1 still holds. Also,
the resultant algorithm is completely local. Note in
particular that determining if edge (i, j) should be re-.
moved from M could be done by communicating only
with neighbors.
Remark 2. The convergence bound in Theorem 1
is asymptotically optimal since there exists a graph
with an assignment of weights, such that no algorithm
which moves weights across a set of matching edges
can have a convergence factor of Q(A2/2d). This is
seen by considering the line graph of size n with the
real weights 1,2, . . . . . n on its vertices. It is easy to
see that any algorithm on our model cannot achieve
a convergence factor beyond 0(1/n2) for this input.
Since Az/2d = O(l/nz), the claim follows.
4 Algorithm for IWM (Dis-
crete Weights)
In this section we extend our result from Section 3 to
the case when the weights are discrete. Note that in
this case, the weights at the endpoints of an edge can-
not be equalized beyond a precision of one unit. We
modify Algorithm LR and obtain Algorithm Discrete-
Local-Random (DLR).
In Algorithm DLR a matching M is chosen lo-
cally in the same manner as the algorithm LR. The
only difference is in weight equalization for the edges
in M. Assume that an edge (i, j) has been chosen
in M and without loss of generality wi > wj. When
(wi +wj) is even, weight equalization as in Algorithm
LR suffices. But when (wi + Wj) is odd, a total of
Wi - ((wi + Wj – 1)/2) load units are moved from
agent i to agent j. Note that the new weights on
agents i and j after this transfer are (wi + wj + 1)/2
and (wi + wj – 1)/2 respectively.
Theorem 2 For, any connected graph G, and weight
vector W, the expected value of the convergence fac-
tor CD of Algorithm DLR is at [east & as long as
@ > 2dn/A2.
Proof : Let Ai,j (even) denote the drop in poten-
tial due to weight transfer across edges (i, j) where
(Wi+wj) k even. Let Ai,j(dd) denote the drop in Po-
tential due to weight transfer across edges (i, j) where
(wi +wj) is odd. Note that Ai,j(even) = (wi – Wi)2/2and Ai,j(odd) = ((wi – wj)’ – 1)/2. Note,
~(i,j) Pr((ij) E M) * Ai,j(even)E(CD) =
@
By Lemma 1,
(
~(i)j)~E(wi - ‘j)’ -1E(CD) > &
@)
(
1 ~(i,j)cE (Wi – ‘j)z _ ;
‘ma ~~=~ (Wi - =)’)
231
where e ~ nd is the number of edges in G. As in The-
orem 1, the first term above is at least AZ. Therefore,
E(CD)’> ~ – ~. Clearly, when @ = t2(2dn/JZ)l
we have E(CD) = $2( A2/32d). B
Theorem 3 Consider an agent graph G with n ver-
tices and degree d, and weight vector U, such that po-
tential @ = O(dn/~2). The expected value of A@D is
Q(lfd), where A@D is the random variable denoting
the decrease in potential by using Algorithm DLR.
Proof : We proceed as in the proof of Theorem 2.
As in the proof there, using the values of Ai,j (even)
and Ai,j (odd), we have
(’Wi- Wj)2E(A@D) = ~ Pr((i, j) E Al)* ~
.(i,j)
+ ~ Pr((i, j) 6 ill) *((W, - Wj)2 -1)
(i,])2
As long as there exists an edge (i, j) such that
IWi–Wj122, we get A@D is fl(l/d) using Lemma 1.
Note that, if for every edge (i, j) C E, I Wi – Wj IS
1, then the load–balanced state has already been
reached. 9
Remark 1. The claim in Remark 2 in Section 3 con-
cerning the lower bound on the convergence factor
applies here as well. That is, Algorithm DLR hss
an asymptotically optimal convergence factor when
@ = Q(2dn/~2).
Remark 2. When @is 0(2dn/A2), we can show that
no algorithm can guarantee a convergence factor of
$2( A2/2d) for every input graph and weight distribu-
tion G. For proof, consider a line graph of size n with
the weights 1,2 , . . ..n–l. n+l on the agents. We omit
further details of the argument.
5 Applications
As mentioned earlier, our algorithm for the IWM
problem can be used to obtain efficient algorithms for
a variety of problems like dynamic load balancing,
problem re-partitioning and progressive task migra-
tion. In this section we describe the application to
dynamic load balancing; description of our other ap-
plications are omitted here. Formally, the problem of
dynamic load balancing problem is as follows: given
processor graph G and discrete load distribution G of
the processors, move loads units along matching links
in G in each time step so as to balance the load on
each processor. The balanced state reached here is
that in which, for each edge (i, j), the load difference
I wi – wj I is either O or 1. We can solve the dynamic
load balancing problem by repeatedly invoking Al-
gorithm DLR and moving incrementally towards the
load-balanced state. This yields,
Theorem 4 The dynamic load balancing problem for
discrete weights can be solved by invoking Algorithm
DLR O((d/Az)(log 130 + dn)) times, with high proba-
bility y.
Proof: Let @o be initial potential and let ~k be
the random variable denoting the potential of pro-
cessor graph G after the kth invocation of Algo-
rithm DLR. Henceforth, we consider only the case
when 00 is $2(dn/Jz); the case when @o = O(dn/Az)
can be handled similarly. By Theorem 2, as long
= ~k = Q(dn/~z), the (k + l)th invocation of the
Algorithm DLR decreases the potential by a factor
of-y= f2(~z/16d) in the expected case. There-
fore Algorithm DLR is invoked O((d/A2) log @o) times
in the expected case before the potential becomes
O(dn/Az).4 After this happens, by Theorem 3, each
invocation of DLR decreases the potential by an ad-
ditive l/d term in the ex ected case. Therefore Algo-!
rithm DLR is called O(Z (log @o + dn)) times totally
in the expected case before the load–balanced state is
reached.
It remains for us to show that with high probabil-
ity, Algorithm DLR is invoked 0((d/A2)(log 00 + dn))
times. In what follows, we show that it is invoked
0((d/A2)(log @o)) times with high probability before
the potential drops below 0(dn/A2); a similar high
concentration result can be proved on the number of
invocations of DLR when the potential becomes at
most O(dn/&).
Let ml, mz, . . . mr be the random variables rep-
resenting the matchings identified by the algorithm
respectively in invocations 10 c. r. Note that these
random variables are independent. Let Em, (@k) be
the expected value of %’A! computed over all possible
=signments to mk. We claim,
1. &nk(@~) < (1 ‘7)@k-l.
The first claim is easy to see. To see the second claim,
note that by definition,
~(~k) = ~m1,m2,,..,mk(@k)= Em1(Em2(. . .En,_l (~mk(@~))) 00 ,))
< (1 – 7) E~1(E~2(. . .Em._, (iO~-1)) . . .))
(by Claim 1)
By repeating this k – 1 more times, ~(@k) < (1 –
Y)k 130, proving the claim.
4The number of times Algorithm DLR is invoked is at most
*-+1O,%.
232
Given the two claims above we now show that
with high probability, @2k ~ (1 – y)k@o. Note, by
Markov’s Inequality, Pr(@2k > B) ~ w. Choose
B = (l–y)k@o and, Pr(@2k > (l–~)k@O) < (l–y)k.
Clearly, Pr(@z~ > (1 – y)~Qo) is exponent~lly small
in k (recall that ~ < 1). That proves the claim and
also the theorem. m
6 Experimental Observations
An initial assessment of repeated invocations of Al-
gorithm LR and Algorithm DLR for the problem of
dynamic load balancing was obtained through simula-
tion and experimentation on processor graphs of dif-
ferent sizes and connectivities. The load on each pro-
cessor was chosen uniformly and randomly from the
interval (O, a) for various values of a; in what follows,
such a load distribution is denoted by Random(O, a).
We discuss two important issues and present a small
sample of our experimental data,
1. Predicted versus Observed Convergence
Factor: RecaIl that our main results provide only
a lower bound on the convergence factors of our algo-
rithms. The observed average convergence factor of
Algorithm LR was consistently better than the pre-
dicted bound across all load distributions and graphs
that were considered. This is not surprising since the
lower bound on the convergence factor can be attained
only for a very restricted class of weight assignments
to the processors.
Define R to be the ratio of the experimentally ob-
served convergence factor of Algorithm LR to its theo-
retical lower bound of A2/ 16d. Figure 1 shows the plot
of R (averaged over 20 runs of LR) versus the number
of invocations of LR. Observe that the average of the
ratio R decreases with the increasing edge-density ojf
the graph. That is, the observed convergence factor
is significantly more than the theoretical lower bound
for sparse graphs and the two factors are comparable
in the case of dense graphs.
2. Real versus Discrete Loads: The simplicity of
our analysis is based on the fact that the case when
the weights are discrete behaves much the same way
zs the case when the weights are real as long as the po-
tential is large, More precisely, we showed in Section
4 that as long as the potential is larger than 2dn/A2,
the convergence factor of Algorithm of DLR is at least
one-half that of the theoretical lower bound on the
convergence factor for Algorithm LR.
The following are the results of our experiments
to study how closely the convergence factors of the
Algorithm DLR and Algorithm LR behave. Figures
2 and 3 show the decrease in potential (averaged over
20 runs) for 80 invocations of Algorithm LR and Algob
rithm DLR on two graphs of 64 nodes each. It is e~,y
to see that the convergence factor of Algorithm LR
becomes twice as much as that for Algorithm DLR
only at a potential much smaller than the theoreti-
cal bound of 2dn/A2. In all our experiments, we ob-
served that the case when the weights are discrete
behaves much the same way as the case when the
weights are real, even when the potent ial is consider-
ably smaller than the value depicted in our analysis,
namely, 2dn/A2.
A better evaluation of our algorithms can be ob-
tained from analyzing real applications on parallel and
distributed platforms. We are pursuing such a direc-
tion for the problem of dynamic adaptive mesh parti-
tioning on distributed memory machines.
7 Discussion
Recently we have considered the related model of load
movement in which at most one unit of load can be
moved along an edge; the load is restricted to be
moved along a mat thing set of edges as before. We
call this the unit-capacity model. We have modified
our Algorithm DLR from Section 4 to work in the
unit-capacity model and extended our analysis from
there. An important result we derive is the following.
Theorem 5 Algorithm DLR for the IWM problem
can be modified to work on the unit-capacity model
for synchronous networks under possible edge failures
such that the dunamic load balancing problem can be
s“’”ed’4$@+w2)‘nv”ca’ionsto’hiss’-gorithm whe~ the initial pote~tial +0 = L?((fi)2).
As a result we derive an algorithm which converges
to the load balanced state provably faster (when the
initial potential is sufficiently large) than the fastest
algorithm previously known [AA+93] for this prob-
lem. In fact our algorithm works on a weaker model
than the one in [AA+93]. However, our algorithm is
randomized while the algorithm in [AA+93] is deter-
ministic. We defer further details to the full paper.
8 Acknowledgements
Thanks to Ravi Boppana, Stan Eisenstat, Laszlo Lo-
vasz, Eric Mjolsness and Martin Schultz for their crit-
ical feedback and encouragement during this work.
Bibliography
[AA+93] W. Aiello, B. Awerbuch, B. Maggs, and
S. Rae. Approximate Load Balancing on
Dynamic and Asynchronous Networks. In
233
[AB92]
AHS9
B87]
[BB87]
[CA87]
[C89]
[F93]
[FG91]
[GH89]
[HCT89]
[HLS92]
Proc. of 25th ACM Symp on Theory of
Computingj 632-641, May 1993.
Y. Aumann and M. Ben-Or. Computing
with Faulty Arrays. In Proc of 2Jth ACM
Symp on Theory of Computing, 162-169,
May 1992.
J. Aspnes, M. Herlihy and N. Shavit.
Counting Networks and Multiprocessor co-
ordination. In Proc of 28rd ACM Symp on
Theory of Computing, 348-358, May 1991.
B. Bollobas. Random Graphs. Academic
Press, New York. 1987.
M. J. Berger and S. H. Bokhari. A Parti-
tioning Strategy for Nonuniform Problems
on Multiprocessors. IEEE Trans. on Com-
puters, Vol. C-36, No. 5, 570-580, 1987.
G. Cybenko and T. G. Allen. Parallel Al-
gorithms for Classification and Clustering.
In Proc. SPIE Conference on Advanced Ar-
chit ectures and A lgordhms for Signal Pro-
cessing, San Diego, CA 1987.
G. Cybenko. Dynamic Load Balancing
for Distributed Memory Multiprocessors.
Journal of Parallel and Distributed Com-
puting, Vol. 2, No. 7, 279-301, 1989.
R. Feldmann. Game Tree Search on Mas-
sively Parallel Systems. PhD Thesis, Dept.
of Mathematics and Computer Science,
University of Paderborn. August 1993.
M. Factor and D. Gelernter. Software Back-
planes, Realtime Data Fusion and the Pro-
cess Trellis. Research
Report YALE U/D CS\TR-852, Yale Com-
puter Science Department, March 1991.
B. Goldberg and P. Hudak. Implementing
Functional Programs on a Hypercube Mul-
tiprocessor. In Proc. of the Jth Conference
on Hypercubes, Concurrent Computers and
Applications, Vol. 1, 489-503, 1989.
J. Hong, M. Chen and X. Tan. Dynamic
Cyclic Load Balancing on Hypercubes. In
Proc. of the ./th Conference on Hypercubes,
Concurrent Computers and Applications,
Vol. 1, 595-598, 1989.
M. Herlihy, B. Lim, and N. Shavit. Low
contention load balancing on large-scale
multiprocessors. In Proc. of Jth ACM Symp
on Parallel Algorithms and Architectures,
219-227, 1992.
[HT93]
[K88]
[KZ88]
[LK87]
[LM93]
[LMR91]
[LN+89]
[M89]
[MP92]
[N92]
[NX+85]
A. Heirich and S. Taylor. A Parabolic
Theory of Load Balance. Research Re-
port Caltech- CS- TR-93-25, Caltech Scal-
able Concurrent Computation Lab, March
1993.
L. V. Kale. Comparing the Performance
of Two Dynamic Load Distribution Meth-
ods. In Proc. of International Conference
on Parallel Processing, Vol. 1, August 1988.
R. Karp and Y. Zhang. A randomized par-
allel branch-and-bound procedure. In Proc.
of 20th ACM Symp on Theory of Comput-
ing, 290-300, 1988.
F. C. H. Lin and R. M. Keller. The Gradi-
ent Model Load Balancing Method. IEEE
Transactions on Software Engineering, Vol.
13, No. 1, 32-38, 1987.
R. Lueling and B. Monien. A Dynamic Dis-
tributed Load Balancing Algorithm with
Provable Good Performance. In Proc. of
5th ACM Symp on Parallel Algorithms and
Architectures, 164-172, 1993.
R. Lueling, B. Monien and F. Ramme.
Load Balancing in Large Networks: A
Comparative Study. In Proc. of IEEE Symp
on Parallel and Distributed Computing,
Dallas, 1991.
T. Leighton, M. Newman, A. Ranade and
E. Schwabe. Dynamic tree embedding on
butterflies and hypercubes. In Proc. of Ist
ACM Symp on Parallel Algorithms and Ar-
chitectures, 224-234, 1989.
M. Mihail. Conductance and Convergence
of Markov Chains - A Combinatorial Treat-
ment of Expanders. In Proc. of 30th IEEE
Symp on Foundations of Computer Sci-
ence, 526-531, October 1989.
B. Mohar and S. Poljak. Eigenvalues in
Combinatorial Optimization. Research Re-
port 92752, IMA, Minneapolis, 1992.
D. Nicol. Communication Efficient Global
Load Balancing. In Proc. of Scalable High
Performance Computing Conference, 292-
299. Williamsburg, VA. April 1992.
L. M. Ni, C. W. Xu and T. B. Gendreau.
Drafting Algorithm - A Dynamic Process
Migration Protocol for Distributed Sys-
tems. In Proc. of Int. Conf. on Distributed
Computing Systems, 539-546, 1985.
234
[P89]
[PU89]
[R89]
[R91]
C. G. Plaxton. Load Balancing, Selection
and Sorting on the Hypercube. Proc. of Ist
ACM Symp on Parallel Algorithms and Ar-
chitectures, 64-73, 1989.
D. Peleg and E. Upfal. The token distri-
bution problem. SIAM J. on Computing,
Volume 18, 229-243, 1989.
M. O. Rabin. Efficient dispersal of informa-
tion for security, load balancing and fault
tolerance. Journal of the A CM, Vol. 36, No.
3, 335-348, 1989.
A. Ranade. Optimal speedup for backtrack-.
search on a butterfly network. In Proc. Jlrd
ACM Symp on Parallel Algorithms and Ar-
chitectures, 40-49, 1991.
[RSU91] L. Rudolph, M. Slivkin-Allalouf, and
E. Upfal. A simple load balancing scheme
for task allocation in parallel machines. In
Proc. $rd ACM Symp on Parallel Algo-
rithms and Architectures, 237-243, 1991.
[W91] R. D. Williams. Performance of dynamic
load balancing algorithms for unstructured
mesh calculations. Concurrency: Pracice
and Experience, Vol. 3, No. 5, 457-481,
1991.
91 , . . I
.......!=;51.........q ; i : w:’]
—.Number alhwacdorn
Figure 2: Plot of the natural logarithm of the poten-
tial versus the number of invocations of Algorithms
LR and DLR on a 64-node Random Graph with edge
probability 0.5. The values of AZ and d were deter-
mined to be 38 and 56 respectively. The predicted
potential was calculated using the theoretical lower
bound on the convergence factor for the case when
the weights are real. Initial load distribution was
Random(O, 30).
Figure 1: Plot of ratio R (defined in Section 6) versus
the number of invocations of Algorithm LR on a Hy-
percube (*), a Random Graph with edge probability
0.5 (+), and a Clique (x), each with 64 nodes. The
value of R for the three graphs averaged over the invo-
cations were 4.16, 2.05, and 1.46 respectively. Initial
load distribution was Rarzdom(O, 100).
Figure 3: Plot of the natural logarithm of the poten-
tial versus the number of invocations of Algorithms
LR and DLR on a Hypercube with 64 nodes. Here
~z = 2 and d = 6. Again the predicted potential was
calculated using the theoretical lower bound on the
convergence factor for the case when the weights are
real. Initial load distribution was Random(O, 30).
235