scalable and topology-aware load balancers in charm++ amit sharma parallel programming lab, uiuc

34
Scalable and Topology- Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Upload: jerome-day

Post on 06-Jan-2018

221 views

Category:

Documents


0 download

DESCRIPTION

Dynamic Load-Balancing Framework in Charm++ Load balancing task in Charm++ Given a collection of migratable objects and a set of computers connected in a certain topology Find a mapping of objects to processors Almost same amount of computation on each processor Communication between processors is minimum Dynamic mapping of chares to processors

TRANSCRIPT

Page 1: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Scalable and Topology-Aware Load Balancers in Charm++

Amit SharmaParallel Programming Lab, UIUC

Page 2: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Outline Dynamic Load Balancing framework in

Charm++ Load Balancing on Large Machines

Scalable Load Balancer Topology-aware Load Balancers

Page 3: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Dynamic Load-Balancing Framework in Charm++ Load balancing task in Charm++

Given a collection of migratable objects and a set of computers connected in a certain topology

Find a mapping of objects to processors Almost same amount of computation on each processor Communication between processors is minimum

Dynamic mapping of chares to processors

Page 4: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Load-Balancing Approaches Two major approaches

No predictability of load patterns Fully dynamic Early work on State Space Search, Branch&Bound, .. Seed load balancers

With certain predictability CSE, molecular dynamics simulation Measurement-based load balancing strategy

Page 5: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Principle of Persistence Once an application is expressed in terms of

interacting objects, object communication patterns and computational loads tend to persist over time In spite of dynamic behavior

Abrupt and large,but infrequent changes (e.g. AMR) Slow and small changes (e.g. particle migration)

Parallel analog of principle of locality Heuristics, that hold for most CSE applications

Page 6: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Measurement Based Load Balancing Based on Principle of persistence Runtime instrumentation (LB Database)

communication volume and computation time Measurement based load balancers

Use the database periodically to make new decisions Many alternative strategies can use the database

Centralized vs distributed Greedy improvements vs complete reassignments Taking communication into account Taking dependencies into account (More complex) Topology-aware

Page 7: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Load Balancer Strategies Centralized

Object load data are sent to processor 0

Integrate to a complete object graph

Migration decision is broadcasted from processor 0

Global barrier

Distributed Load balancing

among neighboring processors

Build partial object graph

Migration decision is sent to its neighbors

No global barrier

Page 8: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Load Balancing on Very Large Machines – New Challenges Existing load balancing strategies don’t scale on

extremely large machines Consider an application with 1M objects on 64K

processors Limiting factors and issues

Decision-making algorithm Difficult to achieve well-informed load balancing

decisions Resource limitations

Page 9: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Limitations of Centralized Strategies

Effective on small number of processors, easy to achieve good load balance

Limitations (inherently not scalable) Central node - memory/communication bottleneck Decision-making algorithms tend to be very slow

We demonstrate these limitations using the simulator we developed

Page 10: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Memory Overhead (simulation results with lb_test)

050

100150200250300350400450500

LB Memory usage on the central node

(MB)

128K 256K 512K 1MNumber of objects

32K processors64K processors

Lb_test benchmark is a parameterized program that creates a specified number of communicating objects in 2D-mesh.

Run on Lemieux 64 processors

Page 11: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Load Balancing Execution Time

050

100150200250300350400

Execution Time (in seconds)

128K 256K 512K 1MNumber of Objects

GreedyLBGreedyCommLBRefineLB

Execution time of load balancing algorithms on 64K processor simulation

Page 12: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Why Hierarchical LB? Centralized load balancer

Bottleneck for communication on processor 0 Memory constraint

Fully distributed load balancer Neighborhood balancing Without global load information

Hierarchical distributed load balancer Divide into processor groups Apply different strategies at each level Scalable to a large number of processors

Page 13: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

A Hybrid Load Balancing Strategy Dividing processors into independent sets of

groups, and groups are organized in hierarchies (decentralized)

Each group has a leader (the central node) which performs centralized load balancing

A particular hybrid strategy that works well

Gengbin Zheng, PhD Thesis, 2005

Page 14: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Hierarchical Tree (an example)

0 … 1023 6553564512 …1024 … 2047 6451163488 ……...

0 1024 63488 64512

1

64K processor hierarchical tree

•Apply different strategies at each level

Level 0

Level 1

Level 2

1024

64

Page 15: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Our HybridLB Scheme

0 … 1023 6553564512 …1024 … 2047 6451163488 ……...

0 1024 63488 64512

1

Load Data (OCG)

Refinement-based Load balancing

Greedy-based Load balancing

Load Data

token

object

Page 16: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Simulation Study - Memory Usage

050

100150200250300350400450500

Memory usage (MB)

256K 512K 1M

Number of Objects

lb_test simulation of 64K processors

CentralLBHybridLB

Simulation of lb_test benchmark with the performance simulator

Page 17: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Total Load Balancing Time

050

100150200250300350400450

Time(s)

256K 512K 1M

Number of Objects

lb_test simulation of 64K processors

GreedyCommLBHybridLB(GreedyCommLB)

Page 18: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Load Balancing Quality

00.020.040.060.08

0.10.12

Maximum predicted load (seconds)

256K 512K 1MNumber of Objects

lb_test simulation of 64K processors

GreedyCommLBHybridLB

Page 19: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Topology-aware mapping of tasks Problem

Map tasks to processors connected in a topology, such that:

Compute load on processors is balanced Communicating chares (objects) are placed on

nearby processors.

Page 20: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Mapping Model Task Graph :

Gt = (Vt , Et) Weighted graph, undirected edges Nodes chares, w(va) computation Edges communication, cab bytes between va and vb

Topology-graph : Gp = (Vp , Ep) Nodes processors Edges Direct Network Links Ex: 3D-Torus, 2D-Mesh, Hypercube

Page 21: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Model (Cont.) Task Mapping

Assigns tasks to processors P : Vt Vp

Hop-Bytes Hop-Bytes Communication cost The cost imposed on the network is more if

more links are used Weigh inter-processor communication by

distance on the network))(),(( ba

Eeab vPvPdcHB

tab

Page 22: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Metric Metric

Minimize Hop-Bytes, equivalently Hops-per-Byte Hops-per-Byte

Average hops traveled by a byte under a task-mapping.

Page 23: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

TopoLB: Topology-aware LBOverview

First coalesce task graph to n nodes. (n = number of processors) Use MetisLB because it reduces inter-group

communication Can use GreedyLB, GreedyCommLB, etc.

Repeat n times: Pick a task t and processor p Place t on p (P(t) p)

Tarun Agarwal, MS Thesis, 2005

Page 24: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Picking t,p t is the task for which placement in this iteration is critical p is the processor where t costs least Note that:

Cost of placing t on p is approximated as:

cost(t;p) = Pm2Assigned Tasks

ctmd(P(m);p)

+ Pm2Unassigned Tasks

ctmd(pö;p)

where HB(t) = Pm2Vt

ctmd(P(m);P(t))

Hop Bytes = Pt2Vt

HB(t)

Page 25: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Picking t,p (Cont.) Criticality of placing t in this iteration:

By how much will the cost of placing t increase in the future?

Future Cost : t will be placed on some random processor in future iteration

Criticality of t :

future_ cost(t) = jFree Procsj

Ppi 2 Free Procs cost(t;pi)

criticality(t) = future_ cost(t) à cost(t;pbest)

Page 26: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Putting it together

Page 27: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

TopoCentLB: A faster Topology-aware LB Coalesce task graph to n nodes. (n=number of

processors) Picking task t, processor p

t is the task which has maximum total communication with already assigned tasks

p is the processor where t costs least

cost(t;p) = Pm2Assigned Tasks

ctmd(P(m);p)

Page 28: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

TopoCentLB (Cont.) Difference from TopoLB

No notion of criticality as in TopoLB Considers only past mapping Doesn’t look into the future

Running Complexity TopoLB: Depends on the criticality function

O(p|Et|) O(p3)

TopoCentLB O(p|Et|) (with a smaller constant than TopoLB)

Page 29: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Results Compare TopoLB, TopoCentLB, Random

Placement Charm++ LB Simulation mode

2D-Jacobi like benchmark LeanMD Reduction in Hop-Bytes

BlueGene/L 2D-Jacobi like benchmark Reduction in running time

Page 30: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Simulation Results2D-Mesh pattern on a 3D-Torus topology (same

size)

Page 31: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Simulation ResultsLeanMD on 3D-Torus

Page 32: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Experimental Results: Bluegene2D-Mesh pattern on a 3D-Torus (Msg

Size:100KB)

Page 33: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Experimental Results: Bluegene2D-Mesh pattern on a 3D-Mesh (Msg Size:100KB)

Page 34: Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC

Conclusions Need for Scalable LBs in future due to large

machines like BG/L Hybrid Load Balancers Distributed approach for keeping communication

localized in a neighborhood Efficient topology-aware task mapping strategies

which reduce hop-bytes also lead to Lower network latencies Better tolerance to contention and bandwidth

constraints