basic charm++ and load balancing
DESCRIPTION
Basic Charm++ and Load Balancing. Gengbin Zheng charm.cs.uiuc.edu 10/11/2005. Charm++ Basics. Charm++. Parallel library for Object-Oriented C++ applications Invoke functions remotely Messaging via remote method calls (like CORBA) Communication “proxy” objects Methods called by scheduler - PowerPoint PPT PresentationTRANSCRIPT
1
Basic Charm++ and Load Balancing
Gengbin Zheng
charm.cs.uiuc.edu10/11/2005
2
Charm++ Basics
3
Charm++ Parallel library for Object-
Oriented C++ applications Invoke functions remotely Messaging via remote method
calls (like CORBA) Communication “proxy” objects
Methods called by scheduler System determines who runs next
Multiple objects per processor Object migration fully supported
Even with broadcasts, reductions
4
Virtualized Programming Model
User View
System implementation
User writes code in terms of communicating objects
System maps objects to processors
5
Chares – Concurrent Objects
Can be dynamically created on any available processor
Can be accessed from remote processors
Send messages to each other asynchronously
Contain “entry methods”
6
Charm++ Features: Object Arrays
A[0] A[1] A[2] A[3] A[n]
User’s view
Applications are written as a set of communicating objects
7
Charm++ Features: Object Arrays
Charm++ maps those objects onto processors, routing messages as needed
A[0] A[1] A[2] A[3] A[n]
A[3]A[0]
User’s view
System view
8
Charm++ Features: Object Arrays
Charm++ can re-map (migrate) objects for communication, load balance, fault tolerance, etc.
A[0] A[1] A[2] A[3] A[n]
A[3]A[0]
User’s view
System view
9
Charm++ Array Definition
array[1D] foo { entry foo(int problemNo); entry void bar(int x); }
Interface (.ci) file
class foo : public CBase_foo {public:// Remote calls foo(int problemNo) { ... } void bar(int x) { ... } // Migration support: foo(CkMigrateMessage *m) {} void pup(PUP::er &p) {...}};
In a .C file
10
Charm++ Remote Method Calls
To call a method on a remote C++ object foo, use the local “proxy” C++ object CProxy_foo generated from the interface file:
array[1D] foo { entry foo(int problemNo); entry void bar(int x); };
Interface (.ci) file
CProxy_foo someFoo=...;someFoo[i].bar(17);
In a .C file
This results in a network message, and eventually to a call to the real object’s method:
void foo::bar(int x) { ...
}
In another .C file
Generated class
i’th object method and parameters
11
Charm++ Startup Process: Main
module myModule { array[1D] foo { entry foo(int problemNo); entry void bar(int x); } mainchare myMain { entry myMain(int argc,char **argv); }};
Interface (.ci) file
#include “myModule.decl.h”class myMain : public CBase_myMain { myMain(int argc,char **argv) { int nElements=7, i=nElements/2; CProxy_foo f=CProxy_foo::ckNew(2,nElements); f[i].bar(3); }};#include “myModule.def.h”
In a .C file Generated class
Called at startup on PE 0
Special startup object
12
.ci filemainmodule hello {
mainchare mymain {
entry mymain(CkArgMsg *m);
};
};
“ Hello World!”
Generates
hello.decl.h
hello.def.h
#include “hello.decl.h”class mymain : public CBase_mymain{public: mymain(CkArgMsg *m) {
ckout <<“Hello World” <<endl;CkExit();
}};#include “hello.def.h”
.C file
13
Compile and run the programCompiling
• charmc <options> <source file>• -o, -g, -language, -module, -tracemode
pgm: pgm.ci pgm.h pgm.C charmc pgm.ci charmc pgm.C charmc –o pgm pgm.o –language charm++
To run a CHARM++ program named ``pgm'' on four processors, type:
charmrun pgm +p4 <params>
Nodelist file (for network architecture)• list of machines to run the program• host <hostname> <qualifiers>
Example Nodelist File:group main ++shell sshhost Host1host Host2
14
Charm++: Portability Runs on:
Any machine with MPI, including• IBM SP, Blue Gene/L•Cray XT3•Origin2000
PSC’s Lemieux (Quadrics Elan) Clusters with Ethernet (Udp/Tcp) Clusters with Myrinet (GM) Clusters with Amasso cards Apple clusters Even Windows!
SMP-Aware (pthreads)
15
Build Charm++ Download from website
http://charm.cs.uiuc.edu/download.html
Build Charm++ ./build <target> <version> <options>
[compile flags]• ./build charm++ net-linux gm -g
Parallel make (-j2)
Compile code using charmc Portable compiler wrapper Link with “-language charm++”
Run code using charmrun
16
How Charmrun Works?
ssh
connect
Acknowledge
Charmrun charmrun +p4 ./pgm
17
Charmrun (batch mode)
ssh
connect
Acknowledge
Charmrun charmrun ++batch 8
18
Debugging Charm++ Applications
Printf Gdb
Sequentially (standalone mode)
• gdb ./pgm +vp16 Run debugger in
xterm• charmrun +p4 pgm
++debug• charmrun +p4 pgm
++debug-no-pause Memory paranoid Parallel debugger
19
Charm++ Features
20
Message Driven Execution
Scheduler Scheduler
Message Q Message Q
Virtualization leads to Message Driven Execution
21
Prioritized Messages Number of priority bits passed during
message allocation FooMsg * msg = new (size, nbits) FooMsg; Priorities stored at the end of messages
Signed integer priorities:*CkPriorityPtr(msg)=-1;
CkSetQueueing(m, CK_QUEUEING_IFIFO); Unsigned bitvector priorities
CkPriorityPtr(msg)[0]=0x7fffffff;
CkSetQueueing(m, CK_QUEUEING_BFIFO);
22
Advanced Message Features
Expedited messages Message do not go through the
charm++ scheduler (faster) Top priority messages
Immediate messages Entries are executed in an
interrupt or the communication thread
Very fast, but tough to get right
23
Object Migration
24
How to Migrate a Virtual Processor?
Move all application state to new processor
Stack Data (threads) Subroutine variables and calls Managed by compiler
Heap Data Allocated with malloc/free Managed by user
Global Variables Open files, environment
variables, etc. (not handled yet!)
25
Migration Solutions
Stack Data (threads) Automatic: isomalloc stacks
Heap Data Use “-memory isomalloc” -or- Write pup routines
Global Variables Use “-swapglobals”
•Works on ELF platform (Linux and Sun)• Just a pointer swap, no data copying
-or- Remove globals entirely
26
Migrate Heap Data: PUP
Packing/unpacking user allocated data
Basic contract: here is my data Sizing: counts up data size Packing: copies data into message Unpacking: copies data back out Same call works for network,
memory, disk I/O ...
27
Migrate Heap Data: PUP C++ Example
#include “pup.h”#include “pup_stl.h”
class myMesh { std::vector<float> nodes; std::vector<int> elts;public: ... void pup(PUP::er &p) { p|nodes; p|elts; }};
28
Migrate Heap Data: PUP F90 ExampleTYPE(myMesh) INTEGER :: nn,ne REAL*4, ALLOCATABLE(:) :: nodes INTEGER, ALLOCATABLE(:) :: eltsEND TYPE
SUBROUTINE pupMesh(p,mesh) USE MODULE ... INTEGER :: p TYPE(myMesh) :: mesh fpup_int(p,mesh%nn) fpup_int(p,mesh%ne) IF (fpup_isUnpacking(p)) THEN ALLOCATE(mesh%nodes(mesh%nn)) ALLOCATE(mesh%elts(mesh%ne)) END IF fpup_floats(p,mesh%nodes,mesh%nn); fpup_ints(p,mesh%elts,mesh%ne); IF (fpup_isDeleting(p)) deleteMesh(mesh);END SUBROUTINE
29
Automatic Load Balancing
30
Motivation Irregular or dynamic applications
Initial static load balancing Application behaviors change
dynamically Difficult to implement with good parallel
efficiency Versatile, automatic load balancers
Application independent No/little user effort is needed in load
balance Work for both Charm++ and Adaptive
MPI
31
Using Dynamic Mapping to Processors
Migrate objects between processors Use that for dynamic (and static, initial)
load balancing Two major approaches
No predictability of load patterns• Fully dynamic
• Early work on State Space Search, Branch&Bound, ..
With certain predictability• Measurement-based load balancing strategy• CSE, molecular dynamics simulation
32
Applications lack of predictability
Flow of tasks - application generates a continuous flow of tasks The goal of the load balancing
strategies is to balance these tasks across the system for a fast response time and a better throughput
Tasks are assigned at creation time, no migration afterwards
33
Seed Load Balancing
Neighborhood averaging with work-stealing when Idle using immediate messages Load balancing among
neighboring processors• Load is represented by
length of queue Work-stealing at idle
time with interruption-based message
• Fast response to the request
80000 objects, 10% heavy objects
34
Link with a seed load balancer
Use –balance <random|neighbor> Charmc –o pgm pgm.o –balance
neighbor Specify topology
+LBTopo <ring|torus2d|…>
35
Principle of Persistence Once an application is expressed in
terms of interacting objects, object communication patterns and computational loads tend to persist over time In spite of dynamic behavior
• Abrupt and large,but infrequent changes (eg:AMR)
• Slow and small changes (eg: particle migration) Parallel analog of principle of locality
Heuristics, that holds for most CSE applications
Run-time instrumentation is possible
36
Measurement Based Load Balancing
Runtime instrumentation Measures CPU load per object Measures communication volume
between objects Measurement based load
balancers Use the instrumented database
periodically to make new decisions A load balancing strategy takes the
database as input and generates a new object-to-processor mapping
37
Load Balancing – graph partitioning
LB View
mapping of objectsWeighted object graph in view of Load Balancer
Charm++ PE
38
Charm++ Load Balancer in Action
Automatic Load Balancing in Crack Propagation
39
Load Balancer Categories
Centralized Object load data
are sent to processor 0
Integrate to a complete object graph
Migration decision is broadcasted from processor 0
Global barrier
Distributed Load balancing
among neighboring processors
Build partial object graph
Migration decision is sent to its neighbors
No global barrier
40
Main Centralized Load Balancing Strategies
GreedyCommLB a “greedy” load balancing strategy which uses the process
load and communications graph to map the processes with the highest load onto the processors with the lowest load, while trying to keep communicating processes on the same processor
RefineLB Incremental adjustment by moving objects off overloaded
processors to under-utilized processors to reach average load MetisLB
uses the METIS graph partitioning library to partition the object-communication graph with node (object) weights and communication loads on edges.
OrbLB treats objects with spatial coordinates. It applies an
orthogonal recursive bisection algorithm which attempts to provide a more balanced division of space.
Others – the manual discusses several other load balancers which are not used as often, but may be useful in some cases; also, more are being developed
41
Load Balancing StrategiesBaseLB
CentralLB NborBaseLB
OrbLBDummyLB MetisLB RecBisectBfLB
GreedyLB RandCentLB RefineLB
GreedyCommLB RandRefLB RefineCommLB
NeighborLB
GreedyRefLB
42
Neighborhood Load Balancing Strategies
NeighborLB processor tries to average out its
load only among its neighbors WSLB
A load balancer for timeshared workstation clusters, which can detect load changes on desktops and adjust load without interferes with other's use of the desktop
43
Compiler Interface Link time options
-module: Link load balancers as modules
• -module EveryLB
Link multiple modules into binary• -balancer GreedyCommLB -balancer RefineLB
• -balancer ComboCentLB:GreedyLB,RefineLB
44
Runtime Options Run-time options do the same
thing, but override the compile time options +balancer: invoke a load balancer Can have multiple load balancers
•+balancer GreedyCommLB +balancer RefineLB
45
When to Re-balance Load?
Programmer Control: ReadyLoadBalance()
Enable load balancing at specific point Object ready to migrate Re-balance if needed ReadyLoadBalance() called when your chare is ready to be load
balanced – load balancing may not start right away ResumeFromSync() called when load balancing for this chare
has finished
Default: Load balancer is periodicProvide period as a runtime parameter (+LBPeriod)
46
Thank You!
Free source, binaries, manuals, and more information at:http://charm.cs.uiuc.edu/
Parallel Programming Lab at University of Illinois