developing scalable parallel applications in...
TRANSCRIPT
1
Developing Scalable Parallel Applications in X10
David Cunningham, David Grove,Vijay Saraswat, and Olivier Tardieu
IBM Research
Based on material from previous X10 Tutorials by Bard Bloom, Evelyn Duesterwald Steve Fink, Nate Nystrom, Igor Peshansky,
Christoph von Praun, and Vivek SarkarAdditional contributions from Anju Kambadur, Josh Milthorpe and Juemin Zhang
This material is based upon work supported in part by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002.
Please see x10-lang.org for the most up-to-date version of these slides and sample programs.
2
Tutorial Objectives
Learn to develop scalable parallel applications using X10– X10 language and class library fundamentals– Effective patterns of concurrency and distribution– Case studies: draw material from real working code– Lab sessions: learn by doing
Exposure to Asynchronous PGAS programming model– Concepts broader than any single language
Updates on X10 project and community– Highlight what is new in X10 since SC’10– Gather feedback from you about X10
3
X10 Tutorial OutlineX10 Overview
X10 Fundamentals (~60 minutes)X10 Workstealing (~20 minutes)
X10 on GPUs (~40 minutes)
X10 Tools (~20 minutes)
Lab Session I
LUNCH
Four Case Studies (approximately 30 minutes each)LU FactorizationGlobal Load Balancing FrameworkGlobal Matrix LibrarySelections from ANUChem
Summary (~15 minutes)
Lab Session II (~45 minutes)
4
X10: Performance and Productivity at Scale
An evolution of Java for concurrency, scale out, and heterogeneity
The X10 language provides:– Ability to specify fine-grained concurrency
– Ability to distribute computation across large-scale clusters
– Ability to represent heterogeneity at the language level
– Simple programming model for computation offload
– Modern OO language features (build libraries/frameworks)
– Support for writing deadlock-free and determinate code
– Interoperability with Java
– Interoperability with native libraries (eg BLAS)
5
X10 Target Environments
High-end large clusters– BlueGene/P– Power7IH– x86 + IB, Power + IB– Goal: deliver scalable performance competitive with C+MPI
Medium-scale commodity systems– ~100 nodes (~1000 core and ~1 terabyte main memory)– Scale out environments, but MTBF is days not minutes– Programs that run in minutes/hours at this scale– Goal: deliver main-memory performance with simple
programming model (accessible to Java programmers)
Developer laptops– Linux, Max, Windows. Eclipse-based IDE, debugger, etc– Goal: support developer productivity
6
X10 Programming Model
Asynchronous PGAS (Partitioned Global Address Space)– Global address space, but with explicit locality– Activities can be dynamically created and joined
Abstracts hardware details, while still allowing programmer to express locality and concurrency.
Thread/Process
Address Space
Legend
Memory Reference
Parallel activities
…
…
7
X10 APGAS Constructs
Fine grained concurrency
• async S
Atomicity
• atomic S
• when (c) SPlace-shifting operations
• at (P) S
Ordering
• finish S
• clocked, next
Two central ideas: Places and Asynchrony
Language Support for Replication of Values Across Places
Distributed object model
Replicate by default
GlobalRef
Global data structures
DistArray
8
Hello Whole World import x10.io.Console;
class HelloWholeWorld { public static def main(Array[String]) { finish for (p in Place.places()) { async at (p)
Console.OUT.println("Hello World from place" +p.id); } }}
(%1) x10c++ -o HelloWholeWorld -O HelloWholeWorld.x10 (%1) x10c++ -o HelloWholeWorld -O HelloWholeWorld.x10
(%2) runx10 -n 4 HelloWholeWorld(%2) runx10 -n 4 HelloWholeWorldHello World from place 0Hello World from place 0Hello World from place 2Hello World from place 2Hello World from place 3Hello World from place 3Hello World from place 1Hello World from place 1
(%3)(%3)
9
Sequential MontyPiimport x10.io.Console;import x10.util.Random;
class MontyPi { public static def main(args:Array[String](1)) {
val N = Int.parse(args(0));val r = new Random();var result:Double = 0;for (1..N) {val x = r.nextDouble();val y = r.nextDouble();if (x*x + y*y <= 1) result++;
}val pi = 4*result/N;Console.OUT.println(“The value of pi is “ + pi);
}}
10
Concurrent MontyPiimport x10.io.Console;import x10.util.Random;
class MontyPi { public static def main(args:Array[String](1)) {
val N = Int.parse(args(0));val P = Int.parse(args(1));val result = new Cell[Double](0);finish for (1..P) async {val r = new Random();var myResult:Double = 0;for (1..(N/P)) {val x = r.nextDouble();val y = r.nextDouble();if (x*x + y*y <= 1) myResult++;
}atomic result() += myResult;
}val pi = 4*(result())/N;Console.OUT.println(“The value of pi is “ + pi);
}}
11
Distributed MontyPiimport x10.io.Console;import x10.util.Random;
class MontyPi { public static def main(args:Array[String](1)) {
val N = Int.parse(args(0));val result = GlobalRef[Cell[Double]](new Cell[Double](0));finish for (p in Place.places()) at (p) async {val r = new Random();var myResult:Double = 0;for (1..(N/Place.MAX_PLACES)) {val x = r.nextDouble(); val y = r.nextDouble();if (x*x + y*y <= 1) myResult++;
}val ans = myResult;at (result) atomic result()() += ans;
}val pi = 4*(result()())/N;Console.OUT.println(“The value of pi is “ + pi);
}}
12
X10 ProjectX10 is an open source project (http://x10-lang.org)
– Released under Eclipse Public License (EPL) v1– Current Release: 2.2.1 (September 2011)
X10 Implementations– C++ based (“Native X10”)
• Multi-process (one place per process; multi-node)• Linux, AIX, MacOS, Cygwin, BlueGene/P• x86, x86_64, PowerPC
– JVM based (“Managed X10”)• Multi-process (one place per JVM process; multi-node)
– Limitation on Windows to single process (single place)• Runs on any Java 5/Java 6 JVM
X10DT (X10 IDE) available for Windows, Linux, Mac OS X– Based on Eclipse 3.6– Supports many core development tasks including remote
build/execute facilities
13
X10 Compilation Flow
X10Source
Parsing /Type Check
AST OptimizationsAST LoweringX10 AST
X10 AST
C++ CodeGeneration
Java CodeGeneration
C++ Source Java Source
C++ Compiler Java CompilerXRC XRJXRX
Native Code Bytecode
X10RT
X10 Compiler Front-End
C++ Back-End
Java Back-End
Java VMsNative Env
JNINative X10 Managed X10
14
Overview of Language Features
Many sequential features of Java inherited unchangedClasses (w/ single inheritance)
Interfaces, (w/ multiple inheritance) Instance and static fields
Constructors, (static) initializers Overloaded, over-rideable methods
Garbage collection
Familiar control structures, etc.
Structs
Closures
Multi-dimensional & distributed arrays
Substantial extensions to the type systemDependent types
Generic types
Function types
Type definitions, inference
ConcurrencyFine-grained concurrency:
async S;
Atomicity and synchronization
atomic S;
when (c) S;
Ordering
finish S;
Distribution
GlobalRef[T]
at (p) S;
15
Classes
Classes Single inheritance, multiple interfaces Static and instance methods Static val fields Instances fields may be val or var Values of class types may be null Heap allocated
16
Structs
User defined primitivesNo inheritance May implement
interfaces All fields are val All methods are final Allocated “inline” in
containing object, array or variable
Headerless
struct Complex { val real:Double; val img :Double; def this(r:Double, i:Double) { real = r; img = i; }
def operator + (that:Complex) { return Complex(real + that.real, img + that.img); }
....}
val x = new Array[Complex](1..10)
17
Function Types
(T1, T2, ..., Tn) => U
type of functions that take arguments Ti and returns U
If f: (T) => U and x: T
then invoke with f(x): U
Function types can be used as an interface Define apply operator with
the appropriate signature:
operator this(x:T): U
ClosuresFirst-class functions
(x: T): U => e
used in array initializers:
new Array[int](5, (i:int) => i*i )
yields the array [ 0, 1, 4, 9, 16 ]
18
Generic Types Classes, structs, and
interfaces may have type parameters
class ArrayList[T] Defines a type constructor
ArrayList and a family of types
ArrayList[Int], ArrayList[String], ArrayList[C], ...
ArrayList[C]: as if ArrayList class is copied and C substituted for T
Can instantiate on any type, including primitives (e.g., Int)
public final class ArrayList[T] {private var data:Array[T];private var ptr:int = 0;
public def this(n:int) {data = new Array[T](n);
}….public def addLast(elem:T) {
if (ptr == list.size) { …grow data… }data(ptr++) = elem;
}
public def get(idx:int) {if (idx < 0 || idx >= ptr) { throw …}return data(idx);
}}
19
Dependent Types Classes and structs may have
properties public val instance fields class Region(rank:Int,
zeroBased:Boolean, rect:Boolean) { ... }
Can constrain properties with a boolean expression Region{rank==3}
type of all regions with rank 3
Array[Int]{region==R} type of all arrays defined
over region R R must be a constant or a
final variable in scope at the type
Dependent types are checked statically.
Dependent types used to statically check locality properties
Dependent types used heavily in Array library to check invariants
Dependent type system is extensible
Dependent types enable optimization (compile-time constant propagation of properties)
20
Type inference Field and local variable types
inferred from initializer typeval x = 1;
x has type int{self==1}
val y = 1..2*1..10; y has type
Region{rank==2,rect}
Method return types inferred from method bodydef m() { ... return true ... return false ... } m has return type Boolean
Loop index types inferred from region
R: Region{rank==2}for (p in R) { ... }
p has type Point{rank==2}
Programming Tip: Often important to not declare types of local vals.
Enables type inference to compute the most precise possible type.
21
async
async S
Creates a new child activity that executes statement S
Returns immediately
S may reference variables in enclosing blocks
Activities cannot be named
Activity cannot be aborted or cancelled
Stmt ::= async Stmt
cf Cilk’s spawn
// Compute the Fibonacci// sequence in parallel.def fib(n:Int):Int {if (n < 2) return 1; val f1:Int;
val f2:Int;finish { async f1 = fib(n-1);f2 = fib(n-2);
}return f1+f2;
}
22
// Compute the Fibonacci// sequence in parallel.def fib(n:Int):Int {if (n < 2) return 1; val f1:Int;
val f2:Int;finish { async f1 = fib(n-1);f2 = fib(n-2);
}return f1+f2;
}
finish
finish S
Execute S, but wait until all (transitively) spawned asyncs have terminated.
Rooted exception model
Trap all exceptions thrown by spawned activities.
Throw an (aggregate) exception if any spawned async terminates abruptly.
implicit finish at main activity
finish is useful for expressing “synchronous” operations on (local or) remote data.
Stmt ::= finish Stmt
cf Cilk’s sync
23
// push data onto concurrent // list-stackval node = new Node(data);atomic {node.next = head;head = node;
}
atomicatomic S
Execute statement S atomically
Atomic blocks are conceptually executed in a serialized order with respect to all other atomic blocks: isolation and weak atomicity.
An atomic block body (S) ... must be nonblocking must not create concurrent
activities (sequential) must not access remote data
(local) restrictions checked dynamically
// target defined in lexically// enclosing scope.atomic def CAS(old:Object, n:Object) {if (target.equals(old)) {target = n;return true;
}return false;
}
Stmt ::= atomic StatementMethodModifier ::= atomic
24
when
when (E) S
Activity suspends until a state inwhich the guard E is true.
In that state, S is executed atomically and in isolation.
Guard E is a boolean expression must be nonblocking must not create concurrent
activities (sequential) must not access remote data
(local) must not have side-effects (const)
Stmt ::= WhenStmtWhenStmt ::= when ( Expr ) Stmt | WhenStmt or (Expr) Stmt
class OneBuffer {var datum:Object = null;var filled:Boolean = false;def send(v:Object) { when ( !filled ) {datum = v;filled = true;
}}def receive():Object {when ( filled ) {val v = datum;datum = null;filled = false;return v;
}}
}
25
at
at (p) S
Execute statement S at place p
Current activity is blocked until S completes
Deep copy of local object graph to target place; the variables referenced from S define the roots of the graph to be copied.
GlobalRef[T] used to create remote pointers across at boundaries
Stmt ::= at (p) Stmt
class C { var x:int;def this(n:int) { x = n; }
}
// Increment remote counterdef inc(c:GlobalRef[C]) {at (c.home) c().x++;
}
// Create GR of Cstatic def make(init:int) {val c = new C(init);return GlobalRef[C](c);
}
26
Distributed Object Model
Objects live in a single place
Objects can only be accessed in the place where they live
Object graph reachable from the statements in an “at” will be deeply copied from the source to destination place and a new isomorphic graph created
Instances fields of an object may be declared transient to break deep copy
GlobalRef[T] enables cross-place pointers to be created
GlobalRef can be freely copied, but referent can only be accessed via the apply at its home place
public struct GlobalRef[T](home:Place){T<:Object} {
public native def this(t:T):GlobalRef{self.home==here};
public native operator this(){here == this.home}:T;
}
27
Summary
Whirlwind tour through X10 languageAdditional details will be introduced via case studies
See http://x10-lang.org for specification, samples, etc.
Next topic: alternate implementations of async/finish in the X10 runtime and their implications for programmers
Questions so far?
28
X10 Runtime
One process per place
• choice inter-process communication library (x10rt)• standalone (shared mem), sockets (tcp/ip), mpi, PAMI
Pool of worker threads per place
• runtime scheduler maps asyncs to threads (many to 1)
• choice of two schedulers• fork-join scheduler (default, ~ Java Fork-Join)• work-stealing scheduler (work in progress, ~ Cilk)
• to use compile with “x10c++ -WORK-STEALING”
29
X10 Schedulers
• each worker maintains a queue of pending jobs• async p; q…
• FJ: worker pushes p, runs q… (job = async)
• WS: worker pushes q…, runs p (job = continuation)
• idle worker• pops job from queue if any• steals job for random other worker if empty (spin)
• queues are double ended• push and pop from one end, steal from other end
=> no contention for statically balanced workloads• less contention if queues are non-empty
30
Blocking Constructs
• finish typically does not block worker • WS: run all subtasks first• FJ: run subtask in queue instead of blocking
• finish may block because of thieves or remote async• worker 1 reaches end of finish block while worker 2 is
still running async
• context switch (blocking finish, when)• FJ: OS-level context switch
• worker thread suspends, another worker thread is awaken
• WS: runtime-level context switch• runtime saves worker’s state, assigns new job to worker
31
• async• queue operations (fences, at worst CAS)
• finish (single place)• synchronized counter (FJ: always, WS: if >1 worker)
• FJ relies on helper objects for finish and async• overhead of object creation + heap allocation + GC
• WS relies on compiler transformation (CPS)• sequential overhead (ex: 1.3x for QuickSort)• steal overhead (~ context switch)
Cost Analysis
32
Pros and Cons
• WS+ much more scalable with fine grain concurrency
+ fixed-size thread pool
- sequential overhead
• FJ- requires coarser tasks
- makes load balancing harder
+ more “obvious” performance model
+ more mature implementation (for now)
33
X10 Tutorial Outline
X10 Overview
X10 on GPUsCUDA via APGAS
K-Means example & demo
X10 Tools
Lab Session I
LUNCH
Case Studies
Summary
Lab Session II
34
Why Program GPUs with X10?
Why program the GPU at all?
Many times faster for certain classes of applications
Hardware is cheap and widely available (as opposed to e.g. FPGA)
Why write X10 (instead of CUDA/OpenCL)?
Use the same programming model (APGAS) on GPU and CPU
Easier to write parallel/distributed GPU-aware programs
Better type safety
Higher level abstractions => fewer lines of code
35
GPU programming will never be as easy as CPU.
We try to make it as easy as possible.
Correctness
Can debug X10 kernel code on CPU first, using standard techniques
Static errors avoid certain classes of faults.
Goal: Eliminate all GPU segfaults with negligible overhead.
Currently detect all dereferences of non-local data (place errors)
TODO: Static array bounds checking
TODO: Static null-pointer checking
Performance: Need understanding of the CUDA performance model
Must know advantages+limitations of [registers, SHM, global memory]
Avoid warp divergence
Avoid irregular/misaligned memory access
Use CUDA profiling tool to debug kernel performance (very easy and usable)
Can inspect and disassemble generated cubin file
Easier to tune blocks/threads using auto-configuration
X10/GPU Programming Experience
36
Why target CUDA and not OpenCL?
In early 2009, OpenCL support was limited
CUDA is based on C++
OpenCL based on C
C++ features help us implement X10 features (e.g. generics)
By targeting CUDA we can re-use parts of existing C++ backend
CUDA evolving fast, more interesting for high level languages (e.g. OOP)
However there are advantages of OpenCL too...
We could support it in future
37
APGAS model ↔ GPU model
at
Java-like sequential constructs
finish
clocks
GPU / host divide
kernel invocation
Shared memConstant mem
Local mem DMAs
__syncthreads()
blocks
threads
async
38
GPU / Host divide
Host Host Host Host
CUDA CUDACUDA CUDA CUDA
High Perf. Network
Goals
Re-use existing language concepts wherever possible
Independent GPU memory space new place⇒
Place count and topology unknown until run-time
Same X10 code works on different run-time configurations
Could use same approach for OpenCL, Cell, FPGA, …
PCI bus
A 'place' inAPGAS model
39
X10/CUDA code (mass sqrt example)for (host in Place.places()) at (host) { val init = new Array[Float](1000, (i:Int)=>i as Float); val recv = new Array[Float](1000); for (accel in here.children().values()) { val remote = CUDAUtilities.makeRemoteArray[Float](accel, 1000, (Int)=>0.0f); val num_blocks = 8, num_threads = 64; finish async at (accel) @CUDA { finish for ([block] in 0..(num_blocks-1)) async { clocked finish for ([thread] in 0..(num_threads-1)) clocked async { val tid = block*num_threads + thread; val tids = num_blocks*num_threads; for (var i:Int=tid ; i<1000 ; i+=tids) { remote(i) = Math.sqrtf(init(i)); } } } } // Console.OUT.println(remote(42)); finish Array.asyncCopy(remote, 0, recv, 0, recv.size); Console.OUT.println(recv(42)); }}
GP
U co
de
Alloc on GPU
Discover GPUs
Copy result to host
Static type error
Implicit captureand transfer to GPU
40
__gl obal __ voi d ker nel ( . . . ){ . . .}
ker nel <<<num_bl ocks, num_t hr eads>>>( . . . )
CUDA threads as APGAS activities
41
CUDA threads as APGAS activities
finish async at (p) @CUDA {
finish for ([block] in 0..(blocks-1)) async {
clocked finish for ([thread] in 0..(threads-1)) clocked async {
val tid = block*threads + thread; val tids = blocks*threads;
for (var i:Int=tid ; i<len ; i+=tids) { remote(i) = Math.sqrtf(init(i)); }
}
}
}
Enforces language restrictions
Define kernel shape
Only sequential code
Only primitive types / arrays
No allocation, no dynamic dispatch
CU
DA
block
CU
DA
kernel
CU
DA
thread
42
Dynamic Shared Memory & barrier
finish async at (p) @CUDA {
finish for (block in 0..(blocks-1)) async {
val shm = new Array[Int](threads, (Int)=>0);
clocked finish for (thread in 0..(threads-1)) clocked async {
shm(thread) = f(thread); Clock.advanceAll(); val tmp = shm((thread+1) % threads);
} } }
Compiled to CUDA'shared' memory
Initialised on GPU(once per block)
General philosophy:
Use existing X10 constructs to express CUDA semantics
APGAS model is sufficient
CUDA barrierrepresentedwith clock op
43
Constant cache memory
val arr = new Array[Int](threads, (Int)=>0);
finish async at (p) @CUDA {
val cmem : Sequence[Int] = arr.sequence();
finish for ([block] in 0..(blocks-1)) async {
clocked finish for ([thread] in 0..(threads-1)) clocked async {
val tmp = cmem(42);
} } }
Compiled to CUDA'constant' cache
Automatically uploadedto GPU (just beforekernel invocation)
('Sequence' is an immutable array)
General philosophy:
Quite similar to 'shared' memory
Again, APGAS model is sufficient
44
Blocks/Threads auto-config
finish async at (p) @CUDA {
val blocks = CUDAUtilities.autoBlocks(), threads = CUDAUtilities.autoThreads();
finish for ([block] in 0..(blocks-1)) async {
clocked finish for ([thread] in 0..(threads-1)) clocked async {
// kernel cannot assume any particular // values for 'blocks' and 'threads'
}
}
}
Maximise 'occupancy'at run-time according to
a simple heuristic
If called on the CPU, autoBlocks() == 1, autoThreads() == 8
45
KMeansCUDA kernel
finish async at (gpu) @CUDA { val blocks = CUDAUtilities.autoBlocks(), threads = CUDAUtilities.autoThreads(); finish for ([block] in 0..(blocks-1)) async { val clustercache = new Array[Float](num_clusters*dims, clusters_copy); clocked finish for ([thread] in 0..(threads-1)) clocked async { val tid = block * threads + thread; val tids = blocks * threads; for (var p:Int=tid ; p<num_local_points ; p+=tids) { var closest:Int = -1; var closest_dist:Float = Float.MAX_VALUE; @Unroll(20) for ([k] in 0..(num_clusters-1)) { var dist : Float = 0; for ([d] in 0..(dims-1)) { val tmp = gpu_points(p+d*num_local_points_stride) - clustercache(k*dims+d); dist += tmp * tmp; } if (dist < closest_dist) { closest_dist = dist; closest = k; } } gpu_nearest(p) = closest; } } } }
SOA form to allow regular accesses
Pad for alignment (stride includes padding)
Strip-mine outer loop
Cache clusters in SHM
Drag inputs from host each iteration
Allocated earlier(via CUDAUtilities)
Unroll this loop
Above Kernel is only part of the story...Use kernel once per iteration of Lloyd's Alg.Copy gpu_nearest to CPU when doneRest of algorithm runs faster on the CPU!
46
K-Means graphical demo
Map of USA (can scroll / zoom)
300 Million points (2 dimensional space)
Data auto-generated from zip-code population dataTown pop. normally distributed around town centers
Use K-Means to find 50 Centroids for USA populationPossible use case: Delivery distribution warehouses
Visualization uses GLRuns on the GPU
Completely independent from CUDA
GPUs can be used for graphics too! :)
Can compare K-Means performance on CPU and GPU
472M 100 2M 400 4M 100 4M 400
0
0.5
1
1.5
2
2.5
3
1x11x22x12x2
K-Means End-to-End Performance
Colours show scaling up to 4 Tesla GPUsY-axis gives Gflops normalized WRT native CUDA
implementation (single GPU, host)
2M/4M points, K=100/400, 4 dimensionsHigher K more GPU work better scaling⇒ ⇒
Analysis
Performance sensitive to many factors:CPU code: fragile g++ optimisationsX10/CUDA DMA bandwidth (GPUs share PCI bus)Infiniband for host<->host communication (also share PCI bus)nvcc register allocation fragile
Outstanding DMA issue (~40% bandwidth loss)CUDA requires X10 CPU objects allocated with cudaMallocHard to integrate with existing libraries / GCCurrently staging DMAs through pre-allocated buffer...complaints by users on CUDA forums
Options for further improving performance:Use multicore for CPU partsHide DMAs
Hosts x GPUs (per host)
48
Future WorkSupport texture cache...
Explicit allocation: val t = Texture.make(...)at (gpu) { … t(x, y) … }
Fermi architecture presents obvious opportunities for supporting X10Indirect branches Object orientation and Exceptions⇒NVidia's goal is the same as ours – allow high level GPU programming
OpenCLNo conceptual hurdlesDifferent code-gen and runtime workX10 programming experience should be very similar
Memory Allocation on GPULatest CUDA has a 'malloc' call we might use
GPU GC cycle between kernel invocationsA research project in itself
atomic operations
GL interoperability
49
X10 Tutorial Outline
X10 Overview
X10 on GPUs
X10 ToolsOverviewDemo of X10DT and X10 Debugger
Lab Session I
LUNCH
Case Studies
Summary
Lab Session II
50
X10 DT: Integrated Development Tools for the X10 Programmer
Building
Debugging
Editing
Help
A modern development environment for productive parallel programming
Source navigation, syntax highlighting, parsing
errors, folding, hyperlinking, outline &
quick outline, hover help, content assist, type
hierarchy, format, search, call graph, quick fixes
Local and remote launch
- Java/C++ back-end support- Remote target configuration
X10 programmer guide
X10DT usage guide
Integrated parallel
X10 debugging Launching
51
X10 Editor plus a set of common IDE servicesSource navigation
Syntax highlighting
Parsing errors
Folding
Hyperlinking
Outline & quick outline
Hover help
Content assist
More services in progress:Type hierarchy view
Format
Search
Call graph view
Quick fixes
Editing
52
Editing Services
Outline view
Source folding
Hover help
Syntax highlighting
53
EditingServices (cont.)
Content-Assist (Ctrl-space)
54
EditingServices (cont.)
Early parsing errors
55
EditingServices (cont.)
Hyperlinking(Ctrl – mouseclick)
opens up new window
56
X10DT Builders
Support both X10 back-ends: Java and C++
Ability to switch back-ends from “Configure”context-menu in the Package Explorer view.
57
X10 Builders (2) – C++ back-end
Sockets, MPI, Parallel Environment, IBM LoadLeveler.
Local vs Remote (SSH) Communication Interfaces
58
X10 Builders (3) – C++ back-end
Ability to customize compilation and linking for C++ generated files
Different sanity checks occur: SSH connection, compiler version, paths existence, etc…
Global validation available.
59
X10 Launcher
Run As shortcuts for the current back-end selected
Dedicated X10launchconfiguration
60
X10 Launcher (2)
Possibility totune communicationinterface optionsat launch-time
Number of Procs
61
X10 Launcher (3)
X10 run with C++ back-end switches automatically to PTP Parallel Runtime Perspective for better monitoring
62
Debugging (C++ backend)
Extends the Parallel Debugger in the Eclipse Parallel Tools Platform (PTP) with support for X10 source level debuggingBreakpoints, stepping, variable views, stack views
63
Debugging (C++ backend)
Breakpoints
Variable View Execution controls
(resume, terminate, step into, step over, step return)
64
Help Part 1: X10 Language Help
Various topics of X10 language help
Help -> Help Contents
65
Help: Part 2: X10DT Help
Various help topics for installing, and
using X10DT
Help -> Help Contents
66
Demo
67
Lab Session I
Objectives– Install X10/X10DT on your laptop– Local build/launch of HelloWholeWorld– Optional remote build/launch of HelloWholeWorld– Explore code in provided tutorial workspace
After lunch– X10 case studies– Summary– Lab Session II
68
X10 Tutorial Outline
X10 Overview
X10 on GPUs
X10 Tools
Lab Session I
LUNCH
Case Studies
Summary
Lab Session II
69
Case Study Outline
LU FactorizationCommon communication/concurrency patterns in X10Teams (collective operations)
Global Load Balancing FrameworkUsage of finish/async/atDesign/usage of a generic framework
Global Matrix LibraryBuilding a scalable distributed data structure in X10API design for end-user productivityInternal communication/concurrency patterns
ANUChemExtending X10’s Array libraryExploitation of X10’s constrained types for performanceUse of distributed “collecting finish” idiom for scalability
All code available in the X10DT workspace; feel free to follow along…
70
LU Factorization
• SPMD implementation• one top-level async per place
finish ateach (p in Dist.makeUnique())
• globally synchronous progress• collectives: barriers, broadcast,
reductions (row, column, WORLD)• point-to-point RDMAs for row swaps• uses BLAS for local computations
• HPL-like algorithm
• 2-d block-cyclic data distribution• right-looking variant with row partial
pivoting (no look ahead)• recursive panel factorization with pivot
search and column broadcast combined
71
Key Features
• Distributed Data Structures• PlaceLocalHandle
• RemoteArray
• Communications• Array.asyncCopy
• Teams
• Optimizations• Finish Pragmas
• C bindings
72
PlaceLocalHandle
PlaceLocalHandle: global handle to per-place objects
Example:
val di st = Di st. makeUni que() ;
val constructor = ( )=>new Array[Doubl e](N);
val buf fers = Pl aceLocal Handl e. make(di st, constructor) ;
• allocates a buffer array in each place
• returns a global handle to the collection of buffers
In each place buffers() evaluates to the local buffer
73
RemoteArray and asyncCopy
RemoteArray: global reference to a fixed local array
Example:
val myBuf fer = buf fers() ;
val gl obal RefToMyBuf fer = new RemoteArray(buf fer) ;
Array.asyncCopy: asynchronous copy from source array to preallocated destination array (source or dest is local)
• uses RDMAs if available
• avoid local data copies
74
Example
def exchange(row1: I nt, row2: I nt, mi n: I nt, max: I nt, dest: I nt) {
val sourceBuf fer = new RemoteArray(buf fers() ) ;
val si ze = matri x() . getRow(row1, mi n, max, buf fers() ) ; / / pack
f i ni sh {
at (Pl ace(dest)) async { / / go to dest
f i ni sh {
/ / copy f rom source to dest
Array. asyncCopy[Doubl e](sourceBuf fer, 0, buf fers() , 0, si ze) ;
} / / wai t for copy 1
matri x() . swapRow(row2, mi n, max, buf fers() ) ; / / swap
/ / copy f rom dest to source
Array. asyncCopy[Doubl e](buf fers() , 0, sourceBuf fer, 0, si ze)
}
} / / wai t for copi es 1 and 2
matri x() . setRow(row1, mi n, max, buf fers() ) ; / / store
}
75
Teams
X10 Teams are like MPI communicators
• predefined Team.WORLD
• constructor
val team01 = new Team([Pl ace(0) , Pl ace(1)]) ;
• split method
col Rol e = here. i d / py;
rowRol e = here. i d % py;
col = Team. WORLD. spl i t(here. i d, rowRol e, col Rol e) ;
• operations: barrier, broadcast, indexOfMax, allreduce…
col . i ndexOfMax(col Rol e, Math. abs(max), i d) ;
col . addReduce(col Rol e, max, Team. MAX);
76
C Bindings
@Nati veCPPI ncl ude("essl _nati ves. h") / / i ncl ude . h i n code generated f rom X10
@Nati veCPPCompi l ati onUni t("essl _nati ves. cc") / / l i nk wi th . cc
cl ass LU {
@Nati veCPPExtern nati ve stati c
def bl ockTri Sol ve(me: Array[Doubl e] , di ag: Array[Doubl e] , B: I nt) : voi d;
@Nati veCPPExtern nati ve stati c
def bl ockBackSol ve(me: Array[Doubl e] , di ag: Array[Doubl e] , B: I nt) : voi d;
@Nati veCPPExtern nati ve stati c
def bl ockMul Sub(me: Array[Doubl e] , l ef t: Array[Doubl e] ,
upper: Array[Doubl e] , B: I nt) : voi d;
• mapping by function names and argument positions
• supports primitive types and arrays of primitives
77
Finish Pragmas
Optimize specific finish/async patterns
@Pragma(Pragma. FI NI SH_HERE) f i ni sh {
at (Pl ace(dest)) async {
@Pragma(Pragma. FI NI SH_ASYNC) f i ni sh {
Array. asyncCopy[Doubl e](sourceBuf fer, 0, buf fers() , 0, si ze) ;
}
matri x() . swapRow(row2, mi n, max, buf fers() ) ; / / swap
Array. asyncCopy[Doubl e](buf fers() , 0, sourceBuf fer, 0, si ze)
}
}
@Pragma(Pragma. FI NI SH_ASYNC) / / uni que async
@Pragma(Pragma. FI NI SH_HERE) / / l ast async to termi nate i s l ocal
@Pragma(Pragma. FI NI SH_LOCAL) / / no remote async
@Pragma(Pragma. FI NI SH_SPMD) / / no nested remote async
78
Global Load Balancing
79
Work-stealing in distributed memory
Autonomous stealing not possibleNeed help from NIC or victim
Sparsely distributed work is expensive to findNetwork congestion, wasted CPU-cycles
Non-trivial to detect terminationUsers need to implement their own
Well-studied, but hard to implement
One task stolen at a timeShallow trees necessitate repeated stealing
Requires aggregation of steals
80
GLB
80
Main activity spawns an activity per place within a finish.
Augment work-stealing with work-distributing:Each activity processes. When no work left, steals at most w-times.(Thief) If no work found, visits at most z-lifelines in turn.
Donor records thief needs work(Thief) If no work found, dies.(Donor) Pushes work to recorded thieves, when it has some.
When all activities die, main activity continues to completion.
Caveat: This scheme assumes that work is relocatable
81
GLB: state transitions
81
computing
computing
stealingstealing
deaddead
distributing
distributing
Can enter X10 to handle asynchronous requests
82
GLB: sample execution
82
Number of steal attempts = 2Number of lifelines = 2
83
GLB: lifeline graphs
83
Structure determinesHow soon work is propagatedHow much overhead is incurredRing network
P/2 hops on averageLow overhead
Desirable propertiesEach vertex must have bounded out-degreeConnectedLow diameter
84
Lifeline graph: cyclic hyper-cubes
84
2020
1010
1111
0101
0000
0202
1212
2121
P = number of processorsChoose h and z such that P <= h^z
P=8, h=3, z=2
Numbers in base h
Each node has at most Z lifelines
85
UTS: Results
85
86
UTS: Results
861 attempt to steal before resorting to lifeline
87
UTS: Results
87
22% more time stealing
83 attempts to steal before resorting to lifeline
88
UTS: Results
88
87% efficiency
ANL’s BG/P
89
UTS: Results
89
86% efficiency
Power7 on IB cluster
90
UTS: Results
90
92% efficiency
IBM’s BG/P
91
UTS: Results
91
89% efficiency
IBM’s Power7 with IB
92
GLB Summary
GLB combines stealing and distributing
Scales with high efficiency to 2048 processors
X10’s UTS was ~350 lines of actual code
Future workAnalytical framework for lifeline graphs
Implement adaptive stealing
Scaling out finish to 100K processors
Hierarchical UTS with multiple threads per-node
92
93
GLB: framework
93
static final class Fib2 extends TaskFrame[UInt, UInt] { public def runTask(t:UInt, s:Stack[Uint]):void offers UInt { if (t < 20u) offer fib(t); else { s.push(t-1); s.push(t-2); } } public def runRootTask(t:UInt, s:Stack[UInt]!):void offers UInt { runTask(t, s); } }
TaskFrame [T1, T2] is the user-interface for lifeline-based GLBT1 : Type of the work-packet T2 : Return type from the evaluation function
94
Global Matrix Library
95
Global Matrix Library (x10.gml)
• Object-oriented library implementing double-precision, blocked, sparse and dense matrices, distributed across multiple places
• Each kind of matrix implemented in a different class, all implement Matrix operations.
• Operations include cell-wise operations (addition, multiplication etc), matrix multiplication (using the SUMMA algorithm), Euclidean distance, etc.
• Presents standard sequential interface (internal parallelism and distribution).
• Written in X10, with native callouts to BLAS
• Runs on Managed Backend (multiple JVMs connected with sockets)
• Runs on Native Backend (on sockets/Ethernet or MPI/sockets, Infiniband).
96
Global Matrix Library: PageRank
def step(G:…, P:…, GP:…, dupGP:…,
UP:…, P:…, E, :…) {
for (1..iteration) {
GP.mult(G,P)
.scale(alpha)
.copyInto(dupGP);//broadcast
P.local()
.mult(E, UP.mult(U,P.local()))
.scale(1-alpha)
.cellAdd(dupGP.local())
.sync(); // broadcast
}}
while (I < max_iteration) {
p = alpha*(G%*%p)+(1-alpha)*(e%*%u%*%p);
}
X10
DML
• Programmer determines distribution of the data.– G is block distributed per Row
– P is replicated
• Programmer writes sequential, distribution-aware code.
• Perform usual optimizations – allocate big data-structures outside loop
• Could equally well have been in Java.
97
PageRank performance
PageRank Runtime
0.00
1.00
2.00
3.00
4.00
5.00
10 40 90 160 250 360 490 640 810 1000Nonzeros (million) in G
Runt
ime
(Sec
)
Managed-Socket
Native-Socket
Native-MPI
40 cores (8 places, 5 workers)
98
Using GML: Gaussian Non-Negative Matrix Factorization
• Key kernel for topic modeling• Involves factoring a large (D x W)
matrix – D ~ 100M
– W ~ 100K, but sparse (0.001)
• Iterative algorithm, involves distributed sparse matrix multiplication, cell-wise matrix operations.
for (1..iteration) { H.cellMult(WV .transMult(W,V,tW) .cellDiv(WWH .mult(WW.transMult(W,W),H))); W.cellMult(VH .multTrans(V,H) .cellDiv(WHH .mult(W,HH.multTrans(H,H))));}
X10
• Key decision is representation for matrix, and its distribution.
– Note: app code is polymorphic in this choice.
P0
V W
H
H
HH
P1
P2Pn
99
GNNMF Performance
GNMF runtime
0
10
20
30
40
50
60
100 200 300 400 500 600 700 800 900 1000Nonzeros in V (million)
Runt
ime(
sec)
Native-MPI
Native-Socket
Managed-Socket
40 cores (8 places, 5 workers)
100
Linear Regression
for (1..iteration) { d_p.sync(); d_q.transMult(V, Vp.mult(V, d_p), d_q2, false); q.cellAdd(p1.scale(lambda)); alpha = norm_r2 / pq.transMult(p, q)(0, 0); p.copyTo(p1); w.cellAdd(p1.scale(alpha)); old_norm_r2 = norm_r2; r.cellAdd(q.scale(alpha)); norm_r2 = r.norm(r); beta = norm_r2/old_norm_r2; p.scale(beta).cellSub(r);}
while(i < max_iteration) { q = ((t(V) %*% (V %*% p)) + eps * p); alpha = norm_r2 / castAsScalar(t(p) %*% q); w = w + alpha * p; old_norm_r2 = norm_r2; r = r + alpha * q; norm_r2 = sum(r * r); beta = norm_r2 / old_norm_r2; p = -r + beta * p; i = i + 1;}
DML X10
101
Linear regression
Linear Regression Benchmark
0
20
40
60
80
100
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000Nonzeros in V (million)
Runt
ime(
sec)
Native-MPI
Native-Socket
Managed-Socket
40 cores (8 places, 5 workers)
102
Getting the GML Code
Entire GML library, example programs, and build/run scripts included in your tutorial workspace
Also available open source under EPL at x10-lang.orgCurrently only via svn (x10.gml project in x10 svn)
Packaged release and formal announcement coming soon
103
ANU Chem
104
ANUChem
Computational chemistry codes written in X10– FMM: electrostatic calculation using Fast Multipole method
• Included in tutorial workspace
– PME: electrostatic calculation using the Smooth Particle Mesh Ewald method
– Pumja Rasaayani: energy calculation using Hartree-Fock SCF-method
Developed by Josh Milthorpe and Ganesh Venkateshwara at ANU– http://cs.anu.edu.au/~Josh.Milthorpe/anuchem.html
– Open source (EPL); tracks X10 releases
– Around 16,000 lines of X10 code
– Overview and initial results in [Milthorpe et al IPDPS 2011]
• Code under active development; significant performance and scalability improvements in 2.2.1 release vs. results in the paper
105
Goals for Case Study
Deeper understanding of X10 Arrays and DistArrays– Extending with application-specific behavior– Using X10’s constrained types to enable static
checking and optimization– DistArray as example of efficient distributed data
structure• Usage of PlaceLocalHandles• Usage of transient fields and caching
Concurrency/distribution in FMM; use of collecting finish
106
Array and DistArray A Point is an element of an n-dimensional
Cartesian space (n>=1) with integer-valued coordinates e.g., [5], [1, 2], …
A Region is an ordered set of points, all of the same rank
An Array maps every Point in its defining Region to a corresponding data element.
Key operations
creation
apply/set (read/write elements)
map, reduce, scan
iteration, enhanced for loops
copy
A Dist (distribution) maps every Point in its Region to a Place
A DistArray (distributed array) is defined over a Dist and maps every Point in its distribution to a corresponding data element.
Data elements in the DistArray may only be accessed in their home location (the place to which the Dist maps their Point)
Key operations similar to Array, but bulk operations (map, reduce, etc) are distributed operations
With the exception of Array literals ( [1, 2, 3] === new Array[int](3,(i:int)=>i+1) ), Array and DistArray are implemented completely as library classes in the
x10.array package with no special compiler support.
107
DistArrays in Detail
We’ll examine the code of DistArray, Dist, BlockDist– LocalStorage accessed by PlaceLocalHandle– Use of transient fields to control serialization– Indexing operations
• Key idea: Dist.offset is an abstract method that subclasses must implement that is responsible for mapping Points in the global index space into offsets in local storage
– Bulk operations (eg reduce) implemented internally using distributed collecting finish
108
Extending DistArray: MortonDist
MortonDist (3D) used in FMM to define a DistArray that distributes octree boxes (key data structure) to X10 places
See au.edu.anu.mm.MortonDist.x10
Ordering of Points in 2D MortonDist of 0..7*0..7 2D MortonDist of 0..7*0..7 with 5 Places
109
Extending DistArray: PeriodicDist
Enables modular (“wrap-around”) indexing at each Place
Any Dist can be made periodic by encapsulating it in a PeriodicDist. Can then be used in DistArray to get modular indexing
See x10.array.PeriodicDist
110
Arrays in Detail
We’ll examine the implementation of x10.array.Array– Array indexing
• Specialized implementations based on properties. • Optimized for common case of Regions of rank<5, zero-
based and rectangular (no holes in index space)• General case fully supported with minimal time overhead on
indexing (space usage based on bounding box of Region)• Constructor does computations to optimize later indexing• If bounds-checks are disabled, Native X10 indexing
performance identical to equivalent C array– Array iteration
• Compiler optimizes iteration over rectangular Regions into counted for loops.
• Non-rectangular regions fully supported, but uses Point-based Region iterator API (significant performance overhead)
111
FMM: ExpansionRegion
In FMM, a specialized region is used to represent the triangular-shaped region of multipole expansions
Enables error-checking and cleaner code when relevant Arrays defined over ExpansionRegion
Indexing performance good, but in a few key loop nests “hand-inlining” of ExpansionRegion iterator required for maximal performance. Eventually will be handled by X10 compiler improvements (well-known optimizations sufficient, but not yet implemented)
See au.edu.anu.mm.ExpansionRegion and Expansion
112
FMM: Nested Collecting Finish
The FMM “downward pass” is implemented using an elegant pattern of nested distributed collecting finishes– Clean, easy to read code– Good scalability (tested to 512 places on BG/P)
Code to examine– au.edu.anu.mm.Fmm3d.downwardPass– au.edu.anu.mm.FmmBox.downward– au.edu.anu.mm.FmmLeafBox.downward
113
X10 Tutorial Outline
X10 Overview
X10 on GPUs
X10 Tools
Lab Session I
LUNCH
Case Studies
Summary
Lab Session II
114
Programmer Resources
x10-lang.org– Language specification– Online tutorials– X10-related classes and research papers– Performance Tuning Tips– FAQ
• Sample X10 programs– Small examples in X10 distribution and tutorials– X10 Benchmarks: optional download with X10 release– X10 Global Matrix Library– Other Applications
• ANUChem: http://cs.anu.edu.au/~Josh.Milthorpe/anuchem.html• Raytracer and OpenGL KMeans programs• Your application here…
115
X10 Project Road Map
Upcoming Releases– X10 2.2.2 mid-December 2011
• Bug fixes and small improvements
– X10 2.3.0 first half of 2012• Java interoperability• Mixed-mode execution (single program running in mix of
Native and Managed places)• Will be backwards compatible with X10 2.2.x, but may
include language enhancements for determinacy
116
X10 Community
Join the X10 community at x10-lang.org!x10-users mailing listList of upcoming events, papers, classes, users, etc.
X10/X10DT releases:http://x10-lang.org/software/download-x10
Try X10! Want to hear your experiences and feedback!
Very interested in publicizing X10 applications and benchmarks that can be shared with the community.
117
Questions and Discussion
Questions?
Discussion?
Next up: Lab Session II…
118
Lab Session II
Objectives– Work on a more substantial X10 coding project of
interest to you– Some suggested projects
• Explore GlobalMatrixLibrary sample programs; write some new code using GML
• Try writing a new client of GlobalLoadBalancing• Try to parallelize (and then distribute) your favorite recursive
divide and conquer algorithm• Explore using X10DT by modifying sample programs• Try coding up your favorite problem in X10…