harigovind, gengbin zheng, laxmikantkale, michael ...charm.cs.uiuc.edu/posters/loadbalancing.pdf ·...

2
Hari Govind, Gengbin Zheng, Laxmikant Kale, Michael Breitenfeld, Philippe Geubelle University of Illinois at Urbana-Champaign Charm++ Architecture User View: object carrying the job System implementation: objects mapped to real processors Virtual Objects Real processors Software engineering Number of virtual processors can be independently controlled Separate VPs for different modules Message driven execution Computation performed upon receipt of a message Adaptive overlap of communication Predictability : Automatic out-of-core execution Asynchronous reductions Dynamic mapping Heterogeneous clusters Vacate, adjust to speed, share Automatic checkpointing/restarting Automatic dynamic load balancing Change set of processors used Communication optimization Benefits AMPI: Adaptive MPI Each virtual process implemented as a user-level thread embedded in a Charm++ object Real Processors MPI processesImplemented as virtual processes (user-level migratable threads) Load balancing task in Charm++ Given a collection of migratable objects and a set of computers connected in a certain topology Find a mapping of objects to processors Almost same amount of computation on each processor Communication between processors is minimum Dynamic mapping of objects to processors Two major approaches No predictability of load patterns Fully dynamic Early work on State Space Search, Branch&Bound, ... Seed load balancers With certain predictability CSE, molecular dynamics simulation Measurement-based load balancing strategy http://charm.cs.uiuc.edu http://charm.cs.uiuc.edu Motivations Versatile, automatic load balancers are desired Application independent No/little user effort is needed to balance load Addresses the load balancing needs of many different types of applications Tasks are initially represented by object creation messages, or ``seeds''. Seed load balancing involves the movement of seeds, to balance work across processors Low responsiveness Load balancing request blocked by long entries Neighborhood averaging with work-stealing when Idle using immediate messages Interruption-based message Fast response to the request Work-stealing at idle time 80000 objects, 10% heavy objects Principle of Persistence Once an application is expressed in terms of interacting objects, object communication patterns and computational loads tend to persist over time In spite of dynamic behavior Abrupt and large,but infrequent changes (eg:AMR) Slow and small changes (eg: particle migration) Parallel analog of principle of locality Heuristics, that holds for most CSE applications How to Migrate Objects Objects Packing/unpacking functions User-level Threads Global variables: ELF object format: switch GoT pointer Alternative: compiler/pre-processor support Migration of stack Isomalloc (from PM2 in France): Reserve virtual space on all processors for each thread Mmap it when you migrate there Migration of Heap data: Isomalloc heaps User-supplied or compiler generated pack function Measurement Based Load Balancing Based on Principle of persistence Runtime instrumentation Measures communication volume and computation time Measurement based load balancers Use the instrumented data-base periodically to make new decisions Many alternative strategies can use the database Centralized vs. distributed Greedy improvements vs. complete reassignments Taking communication into account Taking dependences into account (More complex) Topology-aware Sequential Refinement and Coarsening Results Shock propagation and reflection down the length of the bar Adaptive mesh modification to capture the shock propagation Applications CSE applications Crack propagation Adaptive mesh refinement Molecular dynamics NAMD CPAIMD Cosmology simulation Fault tolerance Independent mesh adaptivity operations in parallel Locking individual nodes No global synchronization Adjacency data structures: Node-to-node Element-to-element Node-to-element Mesh Adjacency e2e,e2n,n2e,n2n Generate, Modify Mesh Modification Lock(),Unlock() Add/Remove Node() Add/Remove Element() Nodes: Local, Shared, Ghost Elements: Local,Ghost Application (serial or parallel) Mesh Adaptivity Edge Flip, Edge Bisect, Edge Contract, Longest edge bisect, ParFUM Integrated: Geometry as well as ghost layer management Parallel Framework for Unstructured Meshes (ParFUM) Charm++ Load Balancing Framework Seed Load Balancing Adaptivity code integrated with ParFUM High-level algorithms for refinement and coarsening n Built on low- level parallel mesh modification primitives n Maintain up-to- date parallel mesh state including adjacencies, ghost layers and user data Programmer: [Over] decomposition into virtual processors (VP) Runtime: Assigns VPs to processors Enables adaptive runtime strategies Processor Virtualization Parallel machines abound Capabilities enhanced as machines get more powerful PSC Lemieux, ASCI White, Earth Simulator, BG/L Clusters becoming ubiquitous Desktops and Games consoles go parallel: Cell processor, multi-core chips, Applications get more ambitious and complex Adaptive algorithms Irregular or dynamic behavior Multi-component and multi-physics MPI based code limitations No adaptive load balancing Thread 2 stack Thread 4 stack Processor As Memory Code Globals Heap 0x00000000 0xFFFFFFFF Thread 1 stack Code Globals Heap 0x00000000 0xFFFFFFFF Processor Bs Memory Migrate Thread 3 Thread 3 stack Thread migration with isomalloc Load Balancing Strategies

Upload: others

Post on 08-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HariGovind, Gengbin Zheng, LaxmikantKale, Michael ...charm.cs.uiuc.edu/posters/LoadBalancing.pdf · as ghost layer management Parallel Framework for Unstructured Meshes (ParFUM) Charm++

Hari Govind, Gengbin Zheng, Laxmikant Kale, Michael Breitenfeld, Philippe GeubelleUniversity of Illinois at Urbana-Champaign

Charm++ Architecture

User View:object carrying the job

System implementation:objects mapped to real processors

Virtual Objects

Real processors

Software engineeringNumber of virtual processors can be independently controlledSeparate VPs for different modules

Message driven executionComputation performed upon receipt of a messageAdaptive overlap of communicationPredictability :

Automatic out-of-core executionAsynchronous reductions

Dynamic mappingHeterogeneous clusters

Vacate, adjust to speed, shareAutomatic checkpointing/restartingAutomatic dynamic load balancingChange set of processors usedCommunication optimization

Benefits

AMPI: Adaptive MPIEach virtual process implemented as a user-level threadembedded in a Charm++ object

Real Processors

MPI “processes”

Implemented as virtual processes (user-level migratable threads)

Load balancing task in Charm++Given a collection of migratable objects and a set of computers connected in a certain topologyFind a mapping of objects to processors

Almost same amount of computation on each processorCommunication between processors is minimum

Dynamic mapping of objects to processorsTwo major approaches

No predictability of load patternsFully dynamic Early work on State Space Search, Branch&Bound, ...Seed load balancers

With certain predictabilityCSE, molecular dynamics simulationMeasurement-based load balancing strategy

http://charm.cs.uiuc.eduhttp://charm.cs.uiuc.edu

Motivations

Versatile, automatic load balancers are desiredApplication independentNo/little user effort is needed to balance loadAddresses the load balancing needs of many different types of applications

Tasks are initially represented by object creation messages, or ``seeds''.Seed load balancing involves the movement of seeds, to balance work across processorsLow responsiveness

Load balancing request blocked by long entries

Neighborhood averaging with work-stealing when Idle using immediate messages

Interruption-based messageFast response to the requestWork-stealing at idle time

80000 objects, 10% heavy objects

Principle of Persistence

Once an application is expressed in terms of interacting objects, object communication patterns and computational loads tend to persist over time

In spite of dynamic behaviorAbrupt and large,but infrequent changes (eg:AMR)Slow and small changes (eg: particle migration)

Parallel analog of principle of localityHeuristics, that holds for most CSE applications

How to Migrate ObjectsObjects

Packing/unpacking functionsUser-level Threads

Global variables:ELF object format: switch GoT pointerAlternative: compiler/pre-processor support

Migration of stackIsomalloc (from PM2 in France):

Reserve virtual space on all processors for each threadMmap it when you migrate there

Migration of Heap data:Isomalloc heapsUser-supplied or compiler generated pack function

Measurement Based Load Balancing

Based on Principle of persistenceRuntime instrumentation

Measures communication volume and computation time

Measurement based load balancersUse the instrumented data-base periodically to make new decisionsMany alternative strategies can use the database

Centralized vs. distributedGreedy improvements vs. complete reassignmentsTaking communication into accountTaking dependences into account (More complex)Topology-aware

Sequential Refinement and Coarsening Results

Shock propagation and reflection down the length of the bar

Adaptive mesh modification to capture the shock propagation

ApplicationsCSE applications

Crack propagationAdaptive mesh refinement

Molecular dynamicsNAMDCPAIMD

Cosmology simulationFault tolerance

● Independent mesh adaptivity operations in parallel● Locking individual nodes● No global synchronization● Adjacency data structures:

● Node-to-node● Element-to-element● Node-to-element

Mesh Adjacencye2e,e2n,n2e,n2n

Generate, Modify …

Mesh ModificationLock(),Unlock()

Add/Remove Node()Add/Remove Element()

Nodes:Local, Shared, Ghost

Elements:Local,Ghost

Application (serial or parallel)

Mesh AdaptivityEdge Flip, Edge Bisect, Edge Contract, Longest

edge bisect, …

ParFUM

Integrated: Geometry as well as ghost layer management

Parallel Framework for Unstructured Meshes (ParFUM)

Charm++ Load Balancing Framework

Seed Load Balancing

Adaptivity code integrated with ParFUMHigh-level algorithms for refinement and coarsening

n Built on low-level parallel mesh modification primitives

n Maintain up-to-date parallel mesh state including adjacencies, ghost layers and user data

Programmer: [Over] decomposition into virtual processors (VP)

Runtime: Assigns VPs to processors

Enables adaptive runtime strategies

Processor Virtualization

Parallel machines aboundCapabilities enhanced as machines get more powerful

PSC Lemieux, ASCI White, Earth Simulator, BG/LClusters becoming ubiquitous Desktops and Games consoles go parallel:

Cell processor, multi-core chips, Applications get more ambitious and complex

Adaptive algorithmsIrregular or dynamic behaviorMulti-component and multi-physicsMPI based code limitations

No adaptive load balancing

Thread 2 stack

Thread 4 stack

Processor A’s Memory

CodeGlobals

Heap

0x00000000

0xFFFFFFFF

Thread 1 stack

CodeGlobals

Heap

0x00000000

0xFFFFFFFF

Processor B’s Memory

Migrate Thread

3

Thread 3 stack

Thread migration with isomalloc

Load Balancing Strategies

Page 2: HariGovind, Gengbin Zheng, LaxmikantKale, Michael ...charm.cs.uiuc.edu/posters/LoadBalancing.pdf · as ghost layer management Parallel Framework for Unstructured Meshes (ParFUM) Charm++

Apply adaptive load balancing framework for increasingly complex simulationsAdaptive insertion/activation of cohesive elements for dynamic fracture simulationsAdaptive mesh adaptation

Conduct experiments using the load balancing framework on very large parallel machines such as Blue Gene/L

Requires mesh to be partitioned into very large number of chunksExperiment with the hierarchical load balancing strategy

Future work

http://charm.cs.uiuc.eduhttp://charm.cs.uiuc.edu

1-D elastic-plastic wave propagation

Bar is dynamically loaded resulting in an elastic wave propagating down bar, upon reflection from the fixed end the material becomes plastic

Written in AMPI

3-D dynamic elastic-plastic fracture

3D Plastic FractureA single edge notched specimen pulled at both ends with a ramping magnitude of 1 m/s over .01 secondsIsosurface is the extent

of the plastic zone

Load imbalance occurs at the onset of an element turning from elastic to plastic, zone of plasticity forms over a limited number of processors as the crack propagates

0.01

0.1

1

10

100

1 2 4 8 16 32 64 128 256 512 1024

time

per

step

(sec

onds

)

SC2002 Gordon Bell Award

36 ms per step76% efficiency

327K atomswith PME

Lemieux(PSC)

28 s per step

Linear scaling

Processor Utilization against Time on (a) 128 (b) 1024 processors

On 128 processor, a single load balancing step suffices, but

On 1024 processors, we need a “refinement” step.

Load Balancing

Aggressive Load Balancing

Refinement Load

Balancing

Processor Utilization across processors after (a) greedy load balancing and (b) refining

Note that the underloaded processors are left underloaded (as they don’t impact perforamnce); refinement deals only with the overloaded ones

Some overloaded processors

Hari Govind, Gengbin Zheng, Laxmikant Kale, Michael Breitenfeld, Philippe GeubelleUniversity of Illinois at Urbana-Champaign

Profile view of a 3000 processor run of NAMD (White shows idle time)

With LB

0

1

2

3

4

1 101 201 301 401 501 601

Timestep

Simulation time per

step (s)

Without LB

0

1

2

3

4

1 101 201 301 401 501 601

Timestep

Simulation time per

step (s)

LeanMD, Apoa1, 128 processors

Crack Propagation Simulation

•Dynamic 3D crack propagation simulation•400,000 linear strain tetrahedral elements•SGi Altix (NCSA)•32 processors•160 AMPI threads•Simulating an elastic bar•Total run time: 207 seconds

Total execution time: 198 seconds 187 seconds

Load balancing Load balancing

With “stop and go” load balancing scheme With agile load balancing scheme

Molecular Dynamics Simulation Load Balancing on Very Large MachinesScalability limits

Consider an application with 1M objects on 64K processors

Metrics for a multi-dimensional optimizationMemory usage on any one processorDecision-making timeQuality of load balancing decision

0

50

100

150

200

250

300

350

400

450

500

LB Memoryusage on thecentral node

(MB)

128K 256K 512K 1M

Number of objects

32K processors 64K processors

benchmark creates a specified number of communicating objects in 2D-mesh.Run on Lemieux 64 processors, using BigSim

0

50

100

150

200

250

300

350

400

ExecutionTime (inseconds)

128K 256K 512K 1M

Number of Objects

GreedyLB GreedyCommLB RefineLB

Simulation Results : using BigSim

Load Balancing in Fault Tolerance

LeanMD application10 crashes128 processorsCheckpoint every 10

time steps

0 … 1023 6553564512 …1024 … 2047 6451163488 ……...

0 1024 63488 64512

1

Load Data (OCG)

Refinement-based Load balancing

Greedy-based Load balancing

Load Data

tokenobject

Hierarchical Load Balancing

Molecular dynamics and related algorithmse.g., minimization, steering, locally enhanced sampling, alchemical and conformational free energy perturbation

Efficient algorithms for full electrostaticsEffective on affordable commodity hardwareBuilding a complete modeling environmentWritten in Charm++

ATP-Synthase

Double in-memory checkpoint/restartDoes not rely on extra processors

Maintain execution efficiency after restart