load balancing

Post on 03-Dec-2015

224 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

lb

TRANSCRIPT

CS 584

Load Balancing

Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the

goal

Two types of load balancing Static Dynamic

Load Balancing

The load balancing problem can be reduced to the bin-packing problem NP-complete

For simple cases, we can do well, but … Heterogeneity Different types of resources

Processor Network, etc.

Evaluation of load balancing

Efficiency Are the processors always working? How much processing overhead is associated

with the load balance algorithm?

Communication Does load balance introduce or affect the

communication pattern? How much communication overhead is

associated with the load balance algorithm? How many edges are cut in communication

graph?

Partitioning Techniques

Regular grids (-: Easy :-) striping blocking use processing power to divide load more

fairly

Generalized Graphs Levelization Scattered Decomposition Recursive Bisection

Levelization

Begin with a boundary Number these nodes 1

All nodes connected to a level 1 node are labeled 2, etc.Partitioning is performed determine the number of nodes per processor count off the nodes of a level until exhausted proceed to the next level

Levelization

Levelization

Want to insure nearest neighbor comm.If p is # processors and n is # nodes.Let ri be the sum of the number of nodes in contiguous levels i and i + 1Let r = max{r1, r2, … , rn}

Nearest neighbor communication is assured if n/p > r

Scattered Decomposition

Used for highly irregular gridsPartition load into a large number r of rectangular clusters such that r >> pEach processor is given a disjoint set of r/p clusters.Communication overhead can be a problem for highly irregular problems.

Recursive Bisection

Recursively divide the domain in two pieces at each step.3 Methods Recursive Coordinate Bisection Recursive Graph Bisection Recursive Spectral Bisection

Recursive Coordinate Bisection

Divide the domain based on the physical coordinates of the nodes.Pick a dimension and divide in half.

RCB uses no connectivity information lots of edges crossing boundaries partitions may be disconnected

Some new research based on graph separators overcomes some problems.

Ineritial Bisection

Often, coordinate bisection is susceptible to the orientation of the meshSolution: Find the principle axis of the communication graph

Graph Theory Based Algorithms

Geometric algorithms are generally low quality they don’t take into account

connectivity

Graph theory algorithms apply what we know about generalized graphs to the partitioning problemHopefully, they reduce the cut size

Greedy Bisection

Start with a vertex of the smallest degree

least number of edges

Mark all its neighborsMark all its neighbors neighbors, etc.The first n/p marked vertices form one subdomainApply the algorithm on the remaining

Recursive Graph Bisection

Based on graph distance rather than coordinate distance.Determine the two furthest separated nodes

Organize and partition nodes according to their distance from extremities.

Computationally expensive Can use approximation

methods.

Recursive Spectral Bisection

Uses the discrete Laplacian Let A be the adjacency matrixLet D be the diagonal matrix where D[i,i] is the degree of node I

LG = A - D

Recursive Spectral Bisection

LG is negative semidefiniteIts largest eigenvalue is zero and the corresponding eigenvector is all ones.The magnitude of the second largest eigenvalue gives a measure of the connectivity of the graph.Its corresponding eigenvector gives a measure of distances between nodes.

Recursive Spectral Bisection

The eigenvector corresponding to the second largest eigenvalue is the Fiedler vector.Calculation of the Fiedler vector is computationally intensive.RSB yields connected partitions that are very well balanced.

Example

RCB 529 edges cut RGB 618 edges cut

RSB299 edges cut

Global vs Local Partitioning

Global methods produce a “good” partitioningLocal methods can then be used to improve the partitioning

The Kernighan-Lin algorithm

Swap pairs of nodes to decrease the cutWill allow intermediate increases in the cut size to avoid certain local minimaLoop

choose the pair of nodes with largest benefit of swapping

logically exchange them (not for real) lock those nodes until all nodes are locked

Find the sequence of swaps that yields the largest accumulated benefitPerform the swaps for real

The Kernihan-Lin Algorithm

Helpful-Sets

Two Steps Find a set of nodes in one partition and

move it to the other partition to decrease the cut size

Rebalance the load

The set of nodes moved must be helpfulHelpfulness of node is equal to the change in cut size if the node is moved

Helpful-Sets

All these sets are2 - helpful

Helpful-Sets Algorithm

The Helpful-Sets Algorithm

Theory If there is a bisection and if its cut size is not

“too small” then there exists a small 4-helpful set in one side or the other

This 4-helpful set can be moved and will reduce the cut by 4

If imbalance is not “too large” and cut of unbalanced partition is not “too small” then it is possible to rebalance without increasing the cut size by more than 2

Apply the theory iteratively until “too small” condition is met.

Multi-level Hybrid Methods

For very large graphs, time to partition can be extremely costlyReduce time by coarsening the graph shrink a large graph to a smaller one

that has similar characteristics

Coarsen by heavy edge matching simple partitioning heuristics

Multi-level Hybrid Methods

ComparisonsChaco Metis Party

Graph |v| |e| ML IN IN+KL PMetis all all+HSairfoil 4253 12289 85 94 83 85 94 83

(0.08) (0.00) (0.02) (0.04) (0.04) (0.15)

crack 10240 30380 211 377 218 196 243 208(0.16) (0.01) (0.05) (0.14) (0.10) (0.44)

wave 156317 10593319542 9834 9660 9801 10361 9614(3.64) (0.19) (1.61) (3.50) (2.84) (11.93)

lh 1443 20148 36376 22579 13643 9897 total edge weight 487380 (0.33) (0.06) (0.06) (0.23)

mat 73752 17617189359 9555 8869 8869(1.80) (2.04) (3.45) (11.52)

DEBR 10485762097149100286 101674 172204 94272(48.99) (988.39) (16.63) (577.97)

(x.xx) – run time in secondsML – Multilevel (spectral on coarse – KL on intermediate)IN – InertialParty – 5 or 6 different methods

Dynamic Load Balancing

Load is statically partitioned initiallyAdjust load when an imbalance is detected.Objectives rebalance the load keep edge cut minimized (communication) avoid having too much overhead

Dynamic Load Balancing

Consider adaptive algorithmsAfter an interval of computation mesh is adjusted according to an

estimate of the discretization error coarsened in areas refined in others

Mesh adjustment causes load imbalance

Dynamic Load Balancing

After refinement, node 1 ends up with more work

Centralized DLB

Control of the load is centralizedTwo approaches Master-worker (Task scheduling)

Tasks are kept in central location Workers ask for tasks Requires that you have lots of tasks with weak

locality requirements. No major communication between workers

Load Monitor Periodically, monitor load on the processors Adjust load to keep optimal balance

Repartitioning

Consider: dynamic situation is simply a sequence of static situationsSolution: repartition the load after each some partitioning algorithms are very quick

Issues scalability problems how different are current load distribution

and new load distribution data dependencies

Decentralizing DLB

Generally focused on work poolTwo approaches Hierarchy

Fully distributed

Fully Distributed DLB

Lower overhead than centralized schemes.No global information Load is locally optimized Propagation is slow Load balance may not be as good as

centralized load balance schemeThree steps Flow calculation (How much to move) Mesh node selection (Which work to move) Actual mesh node migration

Flow calculation

View as a network flow problem Add source and sink nodes Connect source to all nodes

edge value is current load Connect sink to all nodes

edge value is mean load

processor communication graph

Flow calculation

Many network flow algorithms more intense than necessary not parallel

Use simpler, more scalable algorithmsRandom Matchings pick random neighboring processes exchange some load eventually you may get there

Diffusion

Each processor balances its load with all its neighbors How much work should I have?

How much to send on an edge?

Repeat until all load is balanced

Fqpq

tq

tppq

tp

tp wwww

},{,

1 )(

)(1 tq

tppq

tpq wwl

21

)/1log(

O steps

Diffusion

Convergence to load balance can be slowCan be improved with over-relaxation Monitor what is sent in each step Determine how much to send based on

current imbalance and how much was sent in previous steps

Diffuses load in

21

)/1log(

O steps

Dimension Exchange

Rather than communicate with all neighbors each round, only communicate with one

Comes from dimensions of hypercube Use edge coloring for general graphs

Exchange load with neighbor along a dimension

l = (li + lj)/2

Will converge in d steps if hypercubeSome graphs may need different factor to converge faster

l = li * a + lj * (1 –a)

Diffusion & Dimension Exchange

Can view diffusion as a Jacobi method dimension exchange as Gauss-Seidel

Can use multi-level variants Divide the processor communication

graph in half Determine the load to shift across the

cut Recursively rebalance each half

Mesh node selection

Must identify which mesh nodes to migrate minimize edge cut and overhead

Very dependent on problemShape & size of partition may play a role in accuracy Aspect ratio maintenance Move items that are further away from

center of gravity.

Load Balancing Schemes(Who do I request work from?)

Asynchronous Round Robin each processor maintains target Ask from target then increment target

Global Round Robin target is maintained by master node

Random Polling randomly select a donor each processor has equal probability

top related