basic charm++ and load balancing

1

Basic Charm++ and Load Balancing

Gengbin Zheng

charm.cs.uiuc.edu10/11/2005

2

Charm++ Basics

3

Charm++ Parallel library for Object-

Oriented C++ applications Invoke functions remotely Messaging via remote method

calls (like CORBA) Communication “proxy” objects

Methods called by scheduler System determines who runs next

Multiple objects per processor Object migration fully supported

Even with broadcasts, reductions

4

Virtualized Programming Model

User View

System implementation

User writes code in terms of communicating objects

System maps objects to processors

5

Chares – Concurrent Objects

Can be dynamically created on any available processor

Can be accessed from remote processors

Send messages to each other asynchronously

Contain “entry methods”

6

Charm++ Features: Object Arrays

A[0] A[1] A[2] A[3] A[n]

User’s view

Applications are written as a set of communicating objects

7


Charm++ maps those objects onto processors, routing messages as needed

A[0] A[1] A[2] A[3] A[n]

A[3]A[0]

User’s view

System view

8


Charm++ can re-map (migrate) objects for communication, load balance, fault tolerance, etc.

A[0] A[1] A[2] A[3] A[n]

A[3]A[0]

User’s view

System view

9

Charm++ Array Definition

array[1D] foo { entry foo(int problemNo); entry void bar(int x); }

Interface (.ci) file

class foo : public CBase_foo {public:// Remote calls foo(int problemNo) { ... } void bar(int x) { ... } // Migration support: foo(CkMigrateMessage *m) {} void pup(PUP::er &p) {...}};

In a .C file

10

Charm++ Remote Method Calls

To call a method on a remote C++ object foo, use the local “proxy” C++ object CProxy_foo generated from the interface file:

array[1D] foo { entry foo(int problemNo); entry void bar(int x); };


CProxy_foo someFoo=...;someFoo[i].bar(17);

In a .C file

This results in a network message, and eventually to a call to the real object’s method:

void foo::bar(int x) { ...

}

In another .C file

Generated class

i’th object method and parameters

11

Charm++ Startup Process: Main

module myModule { array[1D] foo { entry foo(int problemNo); entry void bar(int x); } mainchare myMain { entry myMain(int argc,char **argv); }};


#include “myModule.decl.h”class myMain : public CBase_myMain { myMain(int argc,char **argv) { int nElements=7, i=nElements/2; CProxy_foo f=CProxy_foo::ckNew(2,nElements); f[i].bar(3); }};#include “myModule.def.h”

In a .C file Generated class

Called at startup on PE 0

Special startup object

12

.ci filemainmodule hello {

mainchare mymain {

entry mymain(CkArgMsg *m);

};

};

“ Hello World!”

Generates

hello.decl.h

hello.def.h

#include “hello.decl.h”class mymain : public CBase_mymain{public: mymain(CkArgMsg *m) {

ckout <<“Hello World” <<endl;CkExit();

}};#include “hello.def.h”

.C file

13

Compile and run the programCompiling

• charmc <options> <source file>• -o, -g, -language, -module, -tracemode

pgm: pgm.ci pgm.h pgm.C charmc pgm.ci charmc pgm.C charmc –o pgm pgm.o –language charm++

To run a CHARM++ program named ``pgm'' on four processors, type:

charmrun pgm +p4 <params>

Nodelist file (for network architecture)• list of machines to run the program• host <hostname> <qualifiers>

Example Nodelist File:group main ++shell sshhost Host1host Host2

14

Charm++: Portability Runs on:

Any machine with MPI, including• IBM SP, Blue Gene/L•Cray XT3•Origin2000

PSC’s Lemieux (Quadrics Elan) Clusters with Ethernet (Udp/Tcp) Clusters with Myrinet (GM) Clusters with Amasso cards Apple clusters Even Windows!

SMP-Aware (pthreads)

15

Build Charm++ Download from website

http://charm.cs.uiuc.edu/download.html

Build Charm++ ./build <target> <version> <options>

[compile flags]• ./build charm++ net-linux gm -g

Parallel make (-j2)

Compile code using charmc Portable compiler wrapper Link with “-language charm++”

Run code using charmrun

16

How Charmrun Works?

ssh

connect

Acknowledge

Charmrun charmrun +p4 ./pgm

17

Charmrun (batch mode)

ssh

connect

Acknowledge

Charmrun charmrun ++batch 8

18

Debugging Charm++ Applications

Printf Gdb

Sequentially (standalone mode)

• gdb ./pgm +vp16 Run debugger in

xterm• charmrun +p4 pgm

++debug• charmrun +p4 pgm

++debug-no-pause Memory paranoid Parallel debugger

19

Charm++ Features

20

Message Driven Execution

Scheduler Scheduler

Message Q Message Q

Virtualization leads to Message Driven Execution

21

Prioritized Messages Number of priority bits passed during

message allocation FooMsg * msg = new (size, nbits) FooMsg; Priorities stored at the end of messages

Signed integer priorities:*CkPriorityPtr(msg)=-1;

CkSetQueueing(m, CK_QUEUEING_IFIFO); Unsigned bitvector priorities

CkPriorityPtr(msg)[0]=0x7fffffff;

CkSetQueueing(m, CK_QUEUEING_BFIFO);

22

Advanced Message Features

Expedited messages Message do not go through the

charm++ scheduler (faster) Top priority messages

Immediate messages Entries are executed in an

interrupt or the communication thread

Very fast, but tough to get right

23

Object Migration

24

How to Migrate a Virtual Processor?

Move all application state to new processor

Stack Data (threads) Subroutine variables and calls Managed by compiler

Heap Data Allocated with malloc/free Managed by user

Global Variables Open files, environment

variables, etc. (not handled yet!)

25

Migration Solutions

Stack Data (threads) Automatic: isomalloc stacks

Heap Data Use “-memory isomalloc” -or- Write pup routines

Global Variables Use “-swapglobals”

•Works on ELF platform (Linux and Sun)• Just a pointer swap, no data copying

-or- Remove globals entirely

26

Migrate Heap Data: PUP

Packing/unpacking user allocated data

Basic contract: here is my data Sizing: counts up data size Packing: copies data into message Unpacking: copies data back out Same call works for network,

memory, disk I/O ...

27

Migrate Heap Data: PUP C++ Example

#include “pup.h”#include “pup_stl.h”

class myMesh { std::vector<float> nodes; std::vector<int> elts;public: ... void pup(PUP::er &p) { p|nodes; p|elts; }};

28

Migrate Heap Data: PUP F90 ExampleTYPE(myMesh) INTEGER :: nn,ne REAL*4, ALLOCATABLE(:) :: nodes INTEGER, ALLOCATABLE(:) :: eltsEND TYPE

SUBROUTINE pupMesh(p,mesh) USE MODULE ... INTEGER :: p TYPE(myMesh) :: mesh fpup_int(p,mesh%nn) fpup_int(p,mesh%ne) IF (fpup_isUnpacking(p)) THEN ALLOCATE(mesh%nodes(mesh%nn)) ALLOCATE(mesh%elts(mesh%ne)) END IF fpup_floats(p,mesh%nodes,mesh%nn); fpup_ints(p,mesh%elts,mesh%ne); IF (fpup_isDeleting(p)) deleteMesh(mesh);END SUBROUTINE

29

Automatic Load Balancing

30

Motivation Irregular or dynamic applications

Initial static load balancing Application behaviors change

dynamically Difficult to implement with good parallel

efficiency Versatile, automatic load balancers

Application independent No/little user effort is needed in load

balance Work for both Charm++ and Adaptive

MPI

31

Using Dynamic Mapping to Processors

Migrate objects between processors Use that for dynamic (and static, initial)

load balancing Two major approaches

No predictability of load patterns• Fully dynamic

• Early work on State Space Search, Branch&Bound, ..

With certain predictability• Measurement-based load balancing strategy• CSE, molecular dynamics simulation

32

Applications lack of predictability

Flow of tasks - application generates a continuous flow of tasks The goal of the load balancing

strategies is to balance these tasks across the system for a fast response time and a better throughput

Tasks are assigned at creation time, no migration afterwards

33

Seed Load Balancing

Neighborhood averaging with work-stealing when Idle using immediate messages Load balancing among

neighboring processors• Load is represented by

length of queue Work-stealing at idle

time with interruption-based message

• Fast response to the request

80000 objects, 10% heavy objects

34

Link with a seed load balancer

Use –balance <random|neighbor> Charmc –o pgm pgm.o –balance

neighbor Specify topology

+LBTopo <ring|torus2d|…>

35

Principle of Persistence Once an application is expressed in

terms of interacting objects, object communication patterns and computational loads tend to persist over time In spite of dynamic behavior

• Abrupt and large,but infrequent changes (eg:AMR)

• Slow and small changes (eg: particle migration) Parallel analog of principle of locality

Heuristics, that holds for most CSE applications

Run-time instrumentation is possible

36

Measurement Based Load Balancing

Runtime instrumentation Measures CPU load per object Measures communication volume

between objects Measurement based load

balancers Use the instrumented database

periodically to make new decisions A load balancing strategy takes the

database as input and generates a new object-to-processor mapping

37

Load Balancing – graph partitioning

LB View

mapping of objectsWeighted object graph in view of Load Balancer

Charm++ PE

38

Charm++ Load Balancer in Action

Automatic Load Balancing in Crack Propagation

39

Load Balancer Categories

Centralized Object load data

are sent to processor 0

Integrate to a complete object graph

Migration decision is broadcasted from processor 0

Global barrier

Distributed Load balancing

among neighboring processors

Build partial object graph

Migration decision is sent to its neighbors

No global barrier

40

Main Centralized Load Balancing Strategies

GreedyCommLB a “greedy” load balancing strategy which uses the process

load and communications graph to map the processes with the highest load onto the processors with the lowest load, while trying to keep communicating processes on the same processor

RefineLB Incremental adjustment by moving objects off overloaded

processors to under-utilized processors to reach average load MetisLB

uses the METIS graph partitioning library to partition the object-communication graph with node (object) weights and communication loads on edges.

OrbLB treats objects with spatial coordinates. It applies an

orthogonal recursive bisection algorithm which attempts to provide a more balanced division of space.

Others – the manual discusses several other load balancers which are not used as often, but may be useful in some cases; also, more are being developed

41

Load Balancing StrategiesBaseLB

CentralLB NborBaseLB

OrbLBDummyLB MetisLB RecBisectBfLB

GreedyLB RandCentLB RefineLB

GreedyCommLB RandRefLB RefineCommLB

NeighborLB

GreedyRefLB

42

Neighborhood Load Balancing Strategies

NeighborLB processor tries to average out its

load only among its neighbors WSLB

A load balancer for timeshared workstation clusters, which can detect load changes on desktops and adjust load without interferes with other's use of the desktop

43

Compiler Interface Link time options

-module: Link load balancers as modules

• -module EveryLB

Link multiple modules into binary• -balancer GreedyCommLB -balancer RefineLB

• -balancer ComboCentLB:GreedyLB,RefineLB

44

Runtime Options Run-time options do the same

thing, but override the compile time options +balancer: invoke a load balancer Can have multiple load balancers

•+balancer GreedyCommLB +balancer RefineLB

45

When to Re-balance Load?

Programmer Control: ReadyLoadBalance()

Enable load balancing at specific point Object ready to migrate Re-balance if needed ReadyLoadBalance() called when your chare is ready to be load

balanced – load balancing may not start right away ResumeFromSync() called when load balancing for this chare

has finished

Default: Load balancer is periodicProvide period as a runtime parameter (+LBPeriod)

46

Thank You!

Free source, binaries, manuals, and more information at:http://charm.cs.uiuc.edu/

Parallel Programming Lab at University of Illinois

basic charm++ and load balancing

Documents

void foo

remote c object foo

foo somefoo

ci fileclass foo

entry mymainint argc

remote calls fooint

local proxy c object

language charm