developing scalable parallel applications in...

118
1 Developing Scalable Parallel Applications in X10 David Cunningham, David Grove, Vijay Saraswat, and Olivier Tardieu IBM Research Based on material from previous X10 Tutorials by Bard Bloom, Evelyn Duesterwald Steve Fink, Nate Nystrom, Igor Peshansky, Christoph von Praun, and Vivek Sarkar Additional contributions from Anju Kambadur, Josh Milthorpe and Juemin Zhang This material is based upon work supported in part by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002. Please see x10-lang.org for the most up-to-date version of these slides and sample programs.

Upload: others

Post on 17-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

1

Developing Scalable Parallel Applications in X10

David Cunningham, David Grove,Vijay Saraswat, and Olivier Tardieu

IBM Research

Based on material from previous X10 Tutorials by Bard Bloom, Evelyn Duesterwald Steve Fink, Nate Nystrom, Igor Peshansky,

Christoph von Praun, and Vivek SarkarAdditional contributions from Anju Kambadur, Josh Milthorpe and Juemin Zhang

This material is based upon work supported in part by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002.

Please see x10-lang.org for the most up-to-date version of these slides and sample programs.

Page 2: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

2

Tutorial Objectives

Learn to develop scalable parallel applications using X10– X10 language and class library fundamentals– Effective patterns of concurrency and distribution– Case studies: draw material from real working code– Lab sessions: learn by doing

Exposure to Asynchronous PGAS programming model– Concepts broader than any single language

Updates on X10 project and community– Highlight what is new in X10 since SC’10– Gather feedback from you about X10

Page 3: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

3

X10 Tutorial OutlineX10 Overview

X10 Fundamentals (~60 minutes)X10 Workstealing (~20 minutes)

X10 on GPUs (~40 minutes)

X10 Tools (~20 minutes)

Lab Session I

LUNCH

Four Case Studies (approximately 30 minutes each)LU FactorizationGlobal Load Balancing FrameworkGlobal Matrix LibrarySelections from ANUChem

Summary (~15 minutes)

Lab Session II (~45 minutes)

Page 4: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

4

X10: Performance and Productivity at Scale

An evolution of Java for concurrency, scale out, and heterogeneity

The X10 language provides:– Ability to specify fine-grained concurrency

– Ability to distribute computation across large-scale clusters

– Ability to represent heterogeneity at the language level

– Simple programming model for computation offload

– Modern OO language features (build libraries/frameworks)

– Support for writing deadlock-free and determinate code

– Interoperability with Java

– Interoperability with native libraries (eg BLAS)

Page 5: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

5

X10 Target Environments

High-end large clusters– BlueGene/P– Power7IH– x86 + IB, Power + IB– Goal: deliver scalable performance competitive with C+MPI

Medium-scale commodity systems– ~100 nodes (~1000 core and ~1 terabyte main memory)– Scale out environments, but MTBF is days not minutes– Programs that run in minutes/hours at this scale– Goal: deliver main-memory performance with simple

programming model (accessible to Java programmers)

Developer laptops– Linux, Max, Windows. Eclipse-based IDE, debugger, etc– Goal: support developer productivity

Page 6: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

6

X10 Programming Model

Asynchronous PGAS (Partitioned Global Address Space)– Global address space, but with explicit locality– Activities can be dynamically created and joined

Abstracts hardware details, while still allowing programmer to express locality and concurrency.

Thread/Process

Address Space

Legend

Memory Reference

Parallel activities

Page 7: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

7

X10 APGAS Constructs

Fine grained concurrency

• async S

Atomicity

• atomic S

• when (c) SPlace-shifting operations

• at (P) S

Ordering

• finish S

• clocked, next

Two central ideas: Places and Asynchrony

Language Support for Replication of Values Across Places

Distributed object model

Replicate by default

GlobalRef

Global data structures

DistArray

Page 8: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

8

Hello Whole World import x10.io.Console;

class HelloWholeWorld { public static def main(Array[String]) { finish for (p in Place.places()) { async at (p)

Console.OUT.println("Hello World from place" +p.id); } }}

(%1) x10c++ -o HelloWholeWorld -O HelloWholeWorld.x10 (%1) x10c++ -o HelloWholeWorld -O HelloWholeWorld.x10

(%2) runx10 -n 4 HelloWholeWorld(%2) runx10 -n 4 HelloWholeWorldHello World from place 0Hello World from place 0Hello World from place 2Hello World from place 2Hello World from place 3Hello World from place 3Hello World from place 1Hello World from place 1

(%3)(%3)

Page 9: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

9

Sequential MontyPiimport x10.io.Console;import x10.util.Random;

class MontyPi { public static def main(args:Array[String](1)) {

val N = Int.parse(args(0));val r = new Random();var result:Double = 0;for (1..N) {val x = r.nextDouble();val y = r.nextDouble();if (x*x + y*y <= 1) result++;

}val pi = 4*result/N;Console.OUT.println(“The value of pi is “ + pi);

}}

Page 10: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

10

Concurrent MontyPiimport x10.io.Console;import x10.util.Random;

class MontyPi { public static def main(args:Array[String](1)) {

val N = Int.parse(args(0));val P = Int.parse(args(1));val result = new Cell[Double](0);finish for (1..P) async {val r = new Random();var myResult:Double = 0;for (1..(N/P)) {val x = r.nextDouble();val y = r.nextDouble();if (x*x + y*y <= 1) myResult++;

}atomic result() += myResult;

}val pi = 4*(result())/N;Console.OUT.println(“The value of pi is “ + pi);

}}

Page 11: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

11

Distributed MontyPiimport x10.io.Console;import x10.util.Random;

class MontyPi { public static def main(args:Array[String](1)) {

val N = Int.parse(args(0));val result = GlobalRef[Cell[Double]](new Cell[Double](0));finish for (p in Place.places()) at (p) async {val r = new Random();var myResult:Double = 0;for (1..(N/Place.MAX_PLACES)) {val x = r.nextDouble(); val y = r.nextDouble();if (x*x + y*y <= 1) myResult++;

}val ans = myResult;at (result) atomic result()() += ans;

}val pi = 4*(result()())/N;Console.OUT.println(“The value of pi is “ + pi);

}}

Page 12: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

12

X10 ProjectX10 is an open source project (http://x10-lang.org)

– Released under Eclipse Public License (EPL) v1– Current Release: 2.2.1 (September 2011)

X10 Implementations– C++ based (“Native X10”)

• Multi-process (one place per process; multi-node)• Linux, AIX, MacOS, Cygwin, BlueGene/P• x86, x86_64, PowerPC

– JVM based (“Managed X10”)• Multi-process (one place per JVM process; multi-node)

– Limitation on Windows to single process (single place)• Runs on any Java 5/Java 6 JVM

X10DT (X10 IDE) available for Windows, Linux, Mac OS X– Based on Eclipse 3.6– Supports many core development tasks including remote

build/execute facilities

Page 13: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

13

X10 Compilation Flow

X10Source

Parsing /Type Check

AST OptimizationsAST LoweringX10 AST

X10 AST

C++ CodeGeneration

Java CodeGeneration

C++ Source Java Source

C++ Compiler Java CompilerXRC XRJXRX

Native Code Bytecode

X10RT

X10 Compiler Front-End

C++ Back-End

Java Back-End

Java VMsNative Env

JNINative X10 Managed X10

Page 14: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

14

Overview of Language Features

Many sequential features of Java inherited unchangedClasses (w/ single inheritance)

Interfaces, (w/ multiple inheritance) Instance and static fields

Constructors, (static) initializers Overloaded, over-rideable methods

Garbage collection

Familiar control structures, etc.

Structs

Closures

Multi-dimensional & distributed arrays

Substantial extensions to the type systemDependent types

Generic types

Function types

Type definitions, inference

ConcurrencyFine-grained concurrency:

async S;

Atomicity and synchronization

atomic S;

when (c) S;

Ordering

finish S;

Distribution

GlobalRef[T]

at (p) S;

Page 15: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

15

Classes

Classes Single inheritance, multiple interfaces Static and instance methods Static val fields Instances fields may be val or var Values of class types may be null Heap allocated

Page 16: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

16

Structs

User defined primitivesNo inheritance May implement

interfaces All fields are val All methods are final Allocated “inline” in

containing object, array or variable

Headerless

struct Complex { val real:Double; val img :Double; def this(r:Double, i:Double) { real = r; img = i; }

def operator + (that:Complex) { return Complex(real + that.real, img + that.img); }

....}

val x = new Array[Complex](1..10)

Page 17: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

17

Function Types

(T1, T2, ..., Tn) => U

type of functions that take arguments Ti and returns U

If f: (T) => U and x: T

then invoke with f(x): U

Function types can be used as an interface Define apply operator with

the appropriate signature:

operator this(x:T): U

ClosuresFirst-class functions

(x: T): U => e

used in array initializers:

new Array[int](5, (i:int) => i*i )

yields the array [ 0, 1, 4, 9, 16 ]

Page 18: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

18

Generic Types Classes, structs, and

interfaces may have type parameters

class ArrayList[T] Defines a type constructor

ArrayList and a family of types

ArrayList[Int], ArrayList[String], ArrayList[C], ...

ArrayList[C]: as if ArrayList class is copied and C substituted for T

Can instantiate on any type, including primitives (e.g., Int)

public final class ArrayList[T] {private var data:Array[T];private var ptr:int = 0;

public def this(n:int) {data = new Array[T](n);

}….public def addLast(elem:T) {

if (ptr == list.size) { …grow data… }data(ptr++) = elem;

}

public def get(idx:int) {if (idx < 0 || idx >= ptr) { throw …}return data(idx);

}}

Page 19: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

19

Dependent Types Classes and structs may have

properties public val instance fields class Region(rank:Int,

zeroBased:Boolean, rect:Boolean) { ... }

Can constrain properties with a boolean expression Region{rank==3}

type of all regions with rank 3

Array[Int]{region==R} type of all arrays defined

over region R R must be a constant or a

final variable in scope at the type

Dependent types are checked statically.

Dependent types used to statically check locality properties

Dependent types used heavily in Array library to check invariants

Dependent type system is extensible

Dependent types enable optimization (compile-time constant propagation of properties)

Page 20: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

20

Type inference Field and local variable types

inferred from initializer typeval x = 1;

x has type int{self==1}

val y = 1..2*1..10; y has type

Region{rank==2,rect}

Method return types inferred from method bodydef m() { ... return true ... return false ... } m has return type Boolean

Loop index types inferred from region

R: Region{rank==2}for (p in R) { ... }

p has type Point{rank==2}

Programming Tip: Often important to not declare types of local vals.

Enables type inference to compute the most precise possible type.

Page 21: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

21

async

async S

Creates a new child activity that executes statement S

Returns immediately

S may reference variables in enclosing blocks

Activities cannot be named

Activity cannot be aborted or cancelled

Stmt ::= async Stmt

cf Cilk’s spawn

// Compute the Fibonacci// sequence in parallel.def fib(n:Int):Int {if (n < 2) return 1; val f1:Int;

val f2:Int;finish { async f1 = fib(n-1);f2 = fib(n-2);

}return f1+f2;

}

Page 22: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

22

// Compute the Fibonacci// sequence in parallel.def fib(n:Int):Int {if (n < 2) return 1; val f1:Int;

val f2:Int;finish { async f1 = fib(n-1);f2 = fib(n-2);

}return f1+f2;

}

finish

finish S

Execute S, but wait until all (transitively) spawned asyncs have terminated.

Rooted exception model

Trap all exceptions thrown by spawned activities.

Throw an (aggregate) exception if any spawned async terminates abruptly.

implicit finish at main activity

finish is useful for expressing “synchronous” operations on (local or) remote data.

Stmt ::= finish Stmt

cf Cilk’s sync

Page 23: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

23

// push data onto concurrent // list-stackval node = new Node(data);atomic {node.next = head;head = node;

}

atomicatomic S

Execute statement S atomically

Atomic blocks are conceptually executed in a serialized order with respect to all other atomic blocks: isolation and weak atomicity.

An atomic block body (S) ... must be nonblocking must not create concurrent

activities (sequential) must not access remote data

(local) restrictions checked dynamically

// target defined in lexically// enclosing scope.atomic def CAS(old:Object, n:Object) {if (target.equals(old)) {target = n;return true;

}return false;

}

Stmt ::= atomic StatementMethodModifier ::= atomic

Page 24: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

24

when

when (E) S

Activity suspends until a state inwhich the guard E is true.

In that state, S is executed atomically and in isolation.

Guard E is a boolean expression must be nonblocking must not create concurrent

activities (sequential) must not access remote data

(local) must not have side-effects (const)

Stmt ::= WhenStmtWhenStmt ::= when ( Expr ) Stmt | WhenStmt or (Expr) Stmt

class OneBuffer {var datum:Object = null;var filled:Boolean = false;def send(v:Object) { when ( !filled ) {datum = v;filled = true;

}}def receive():Object {when ( filled ) {val v = datum;datum = null;filled = false;return v;

}}

}

Page 25: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

25

at

at (p) S

Execute statement S at place p

Current activity is blocked until S completes

Deep copy of local object graph to target place; the variables referenced from S define the roots of the graph to be copied.

GlobalRef[T] used to create remote pointers across at boundaries

Stmt ::= at (p) Stmt

class C { var x:int;def this(n:int) { x = n; }

}

// Increment remote counterdef inc(c:GlobalRef[C]) {at (c.home) c().x++;

}

// Create GR of Cstatic def make(init:int) {val c = new C(init);return GlobalRef[C](c);

}

Page 26: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

26

Distributed Object Model

Objects live in a single place

Objects can only be accessed in the place where they live

Object graph reachable from the statements in an “at” will be deeply copied from the source to destination place and a new isomorphic graph created

Instances fields of an object may be declared transient to break deep copy

GlobalRef[T] enables cross-place pointers to be created

GlobalRef can be freely copied, but referent can only be accessed via the apply at its home place

public struct GlobalRef[T](home:Place){T<:Object} {

public native def this(t:T):GlobalRef{self.home==here};

public native operator this(){here == this.home}:T;

}

Page 27: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

27

Summary

Whirlwind tour through X10 languageAdditional details will be introduced via case studies

See http://x10-lang.org for specification, samples, etc.

Next topic: alternate implementations of async/finish in the X10 runtime and their implications for programmers

Questions so far?

Page 28: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

28

X10 Runtime

One process per place

• choice inter-process communication library (x10rt)• standalone (shared mem), sockets (tcp/ip), mpi, PAMI

Pool of worker threads per place

• runtime scheduler maps asyncs to threads (many to 1)

• choice of two schedulers• fork-join scheduler (default, ~ Java Fork-Join)• work-stealing scheduler (work in progress, ~ Cilk)

• to use compile with “x10c++ -WORK-STEALING”

Page 29: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

29

X10 Schedulers

• each worker maintains a queue of pending jobs• async p; q…

• FJ: worker pushes p, runs q… (job = async)

• WS: worker pushes q…, runs p (job = continuation)

• idle worker• pops job from queue if any• steals job for random other worker if empty (spin)

• queues are double ended• push and pop from one end, steal from other end

=> no contention for statically balanced workloads• less contention if queues are non-empty

Page 30: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

30

Blocking Constructs

• finish typically does not block worker • WS: run all subtasks first• FJ: run subtask in queue instead of blocking

• finish may block because of thieves or remote async• worker 1 reaches end of finish block while worker 2 is

still running async

• context switch (blocking finish, when)• FJ: OS-level context switch

• worker thread suspends, another worker thread is awaken

• WS: runtime-level context switch• runtime saves worker’s state, assigns new job to worker

Page 31: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

31

• async• queue operations (fences, at worst CAS)

• finish (single place)• synchronized counter (FJ: always, WS: if >1 worker)

• FJ relies on helper objects for finish and async• overhead of object creation + heap allocation + GC

• WS relies on compiler transformation (CPS)• sequential overhead (ex: 1.3x for QuickSort)• steal overhead (~ context switch)

Cost Analysis

Page 32: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

32

Pros and Cons

• WS+ much more scalable with fine grain concurrency

+ fixed-size thread pool

- sequential overhead

• FJ- requires coarser tasks

- makes load balancing harder

+ more “obvious” performance model

+ more mature implementation (for now)

Page 33: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

33

X10 Tutorial Outline

X10 Overview

X10 on GPUsCUDA via APGAS

K-Means example & demo

X10 Tools

Lab Session I

LUNCH

Case Studies

Summary

Lab Session II

Page 34: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

34

Why Program GPUs with X10?

Why program the GPU at all?

Many times faster for certain classes of applications

Hardware is cheap and widely available (as opposed to e.g. FPGA)

Why write X10 (instead of CUDA/OpenCL)?

Use the same programming model (APGAS) on GPU and CPU

Easier to write parallel/distributed GPU-aware programs

Better type safety

Higher level abstractions => fewer lines of code

Page 35: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

35

GPU programming will never be as easy as CPU.

We try to make it as easy as possible.

Correctness

Can debug X10 kernel code on CPU first, using standard techniques

Static errors avoid certain classes of faults.

Goal: Eliminate all GPU segfaults with negligible overhead.

Currently detect all dereferences of non-local data (place errors)

TODO: Static array bounds checking

TODO: Static null-pointer checking

Performance: Need understanding of the CUDA performance model

Must know advantages+limitations of [registers, SHM, global memory]

Avoid warp divergence

Avoid irregular/misaligned memory access

Use CUDA profiling tool to debug kernel performance (very easy and usable)

Can inspect and disassemble generated cubin file

Easier to tune blocks/threads using auto-configuration

X10/GPU Programming Experience

Page 36: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

36

Why target CUDA and not OpenCL?

In early 2009, OpenCL support was limited

CUDA is based on C++

OpenCL based on C

C++ features help us implement X10 features (e.g. generics)

By targeting CUDA we can re-use parts of existing C++ backend

CUDA evolving fast, more interesting for high level languages (e.g. OOP)

However there are advantages of OpenCL too...

We could support it in future

Page 37: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

37

APGAS model ↔ GPU model

at

Java-like sequential constructs

finish

clocks

GPU / host divide

kernel invocation

Shared memConstant mem

Local mem DMAs

__syncthreads()

blocks

threads

async

Page 38: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

38

GPU / Host divide

Host Host Host Host

CUDA CUDACUDA CUDA CUDA

High Perf. Network

Goals

Re-use existing language concepts wherever possible

Independent GPU memory space new place⇒

Place count and topology unknown until run-time

Same X10 code works on different run-time configurations

Could use same approach for OpenCL, Cell, FPGA, …

PCI bus

A 'place' inAPGAS model

Page 39: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

39

X10/CUDA code (mass sqrt example)for (host in Place.places()) at (host) { val init = new Array[Float](1000, (i:Int)=>i as Float); val recv = new Array[Float](1000); for (accel in here.children().values()) { val remote = CUDAUtilities.makeRemoteArray[Float](accel, 1000, (Int)=>0.0f); val num_blocks = 8, num_threads = 64; finish async at (accel) @CUDA { finish for ([block] in 0..(num_blocks-1)) async { clocked finish for ([thread] in 0..(num_threads-1)) clocked async { val tid = block*num_threads + thread; val tids = num_blocks*num_threads; for (var i:Int=tid ; i<1000 ; i+=tids) { remote(i) = Math.sqrtf(init(i)); } } } } // Console.OUT.println(remote(42)); finish Array.asyncCopy(remote, 0, recv, 0, recv.size); Console.OUT.println(recv(42)); }}

GP

U co

de

Alloc on GPU

Discover GPUs

Copy result to host

Static type error

Implicit captureand transfer to GPU

Page 40: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

40

__gl obal __ voi d ker nel ( . . . ){ . . .}

ker nel <<<num_bl ocks, num_t hr eads>>>( . . . )

CUDA threads as APGAS activities

Page 41: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

41

CUDA threads as APGAS activities

finish async at (p) @CUDA {

finish for ([block] in 0..(blocks-1)) async {

clocked finish for ([thread] in 0..(threads-1)) clocked async {

val tid = block*threads + thread; val tids = blocks*threads;

for (var i:Int=tid ; i<len ; i+=tids) { remote(i) = Math.sqrtf(init(i)); }

}

}

}

Enforces language restrictions

Define kernel shape

Only sequential code

Only primitive types / arrays

No allocation, no dynamic dispatch

CU

DA

block

CU

DA

kernel

CU

DA

thread

Page 42: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

42

Dynamic Shared Memory & barrier

finish async at (p) @CUDA {

finish for (block in 0..(blocks-1)) async {

val shm = new Array[Int](threads, (Int)=>0);

clocked finish for (thread in 0..(threads-1)) clocked async {

shm(thread) = f(thread); Clock.advanceAll(); val tmp = shm((thread+1) % threads);

} } }

Compiled to CUDA'shared' memory

Initialised on GPU(once per block)

General philosophy:

Use existing X10 constructs to express CUDA semantics

APGAS model is sufficient

CUDA barrierrepresentedwith clock op

Page 43: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

43

Constant cache memory

val arr = new Array[Int](threads, (Int)=>0);

finish async at (p) @CUDA {

val cmem : Sequence[Int] = arr.sequence();

finish for ([block] in 0..(blocks-1)) async {

clocked finish for ([thread] in 0..(threads-1)) clocked async {

val tmp = cmem(42);

} } }

Compiled to CUDA'constant' cache

Automatically uploadedto GPU (just beforekernel invocation)

('Sequence' is an immutable array)

General philosophy:

Quite similar to 'shared' memory

Again, APGAS model is sufficient

Page 44: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

44

Blocks/Threads auto-config

finish async at (p) @CUDA {

val blocks = CUDAUtilities.autoBlocks(), threads = CUDAUtilities.autoThreads();

finish for ([block] in 0..(blocks-1)) async {

clocked finish for ([thread] in 0..(threads-1)) clocked async {

// kernel cannot assume any particular // values for 'blocks' and 'threads'

}

}

}

Maximise 'occupancy'at run-time according to

a simple heuristic

If called on the CPU, autoBlocks() == 1, autoThreads() == 8

Page 45: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

45

KMeansCUDA kernel

finish async at (gpu) @CUDA { val blocks = CUDAUtilities.autoBlocks(), threads = CUDAUtilities.autoThreads(); finish for ([block] in 0..(blocks-1)) async { val clustercache = new Array[Float](num_clusters*dims, clusters_copy); clocked finish for ([thread] in 0..(threads-1)) clocked async { val tid = block * threads + thread; val tids = blocks * threads; for (var p:Int=tid ; p<num_local_points ; p+=tids) { var closest:Int = -1; var closest_dist:Float = Float.MAX_VALUE; @Unroll(20) for ([k] in 0..(num_clusters-1)) { var dist : Float = 0; for ([d] in 0..(dims-1)) { val tmp = gpu_points(p+d*num_local_points_stride) - clustercache(k*dims+d); dist += tmp * tmp; } if (dist < closest_dist) { closest_dist = dist; closest = k; } } gpu_nearest(p) = closest; } } } }

SOA form to allow regular accesses

Pad for alignment (stride includes padding)

Strip-mine outer loop

Cache clusters in SHM

Drag inputs from host each iteration

Allocated earlier(via CUDAUtilities)

Unroll this loop

Above Kernel is only part of the story...Use kernel once per iteration of Lloyd's Alg.Copy gpu_nearest to CPU when doneRest of algorithm runs faster on the CPU!

Page 46: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

46

K-Means graphical demo

Map of USA (can scroll / zoom)

300 Million points (2 dimensional space)

Data auto-generated from zip-code population dataTown pop. normally distributed around town centers

Use K-Means to find 50 Centroids for USA populationPossible use case: Delivery distribution warehouses

Visualization uses GLRuns on the GPU

Completely independent from CUDA

GPUs can be used for graphics too! :)

Can compare K-Means performance on CPU and GPU

Page 47: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

472M 100 2M 400 4M 100 4M 400

0

0.5

1

1.5

2

2.5

3

1x11x22x12x2

K-Means End-to-End Performance

Colours show scaling up to 4 Tesla GPUsY-axis gives Gflops normalized WRT native CUDA

implementation (single GPU, host)

2M/4M points, K=100/400, 4 dimensionsHigher K more GPU work better scaling⇒ ⇒

Analysis

Performance sensitive to many factors:CPU code: fragile g++ optimisationsX10/CUDA DMA bandwidth (GPUs share PCI bus)Infiniband for host<->host communication (also share PCI bus)nvcc register allocation fragile

Outstanding DMA issue (~40% bandwidth loss)CUDA requires X10 CPU objects allocated with cudaMallocHard to integrate with existing libraries / GCCurrently staging DMAs through pre-allocated buffer...complaints by users on CUDA forums

Options for further improving performance:Use multicore for CPU partsHide DMAs

Hosts x GPUs (per host)

Page 48: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

48

Future WorkSupport texture cache...

Explicit allocation: val t = Texture.make(...)at (gpu) { … t(x, y) … }

Fermi architecture presents obvious opportunities for supporting X10Indirect branches Object orientation and Exceptions⇒NVidia's goal is the same as ours – allow high level GPU programming

OpenCLNo conceptual hurdlesDifferent code-gen and runtime workX10 programming experience should be very similar

Memory Allocation on GPULatest CUDA has a 'malloc' call we might use

GPU GC cycle between kernel invocationsA research project in itself

atomic operations

GL interoperability

Page 49: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

49

X10 Tutorial Outline

X10 Overview

X10 on GPUs

X10 ToolsOverviewDemo of X10DT and X10 Debugger

Lab Session I

LUNCH

Case Studies

Summary

Lab Session II

Page 50: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

50

X10 DT: Integrated Development Tools for the X10 Programmer

Building

Debugging

Editing

Help

A modern development environment for productive parallel programming

Source navigation, syntax highlighting, parsing

errors, folding, hyperlinking, outline &

quick outline, hover help, content assist, type

hierarchy, format, search, call graph, quick fixes

Local and remote launch

- Java/C++ back-end support- Remote target configuration

X10 programmer guide

X10DT usage guide

Integrated parallel

X10 debugging Launching

Page 51: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

51

X10 Editor plus a set of common IDE servicesSource navigation

Syntax highlighting

Parsing errors

Folding

Hyperlinking

Outline & quick outline

Hover help

Content assist

More services in progress:Type hierarchy view

Format

Search

Call graph view

Quick fixes

Editing

Page 52: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

52

Editing Services

Outline view

Source folding

Hover help

Syntax highlighting

Page 53: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

53

EditingServices (cont.)

Content-Assist (Ctrl-space)

Page 54: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

54

EditingServices (cont.)

Early parsing errors

Page 55: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

55

EditingServices (cont.)

Hyperlinking(Ctrl – mouseclick)

opens up new window

Page 56: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

56

X10DT Builders

Support both X10 back-ends: Java and C++

Ability to switch back-ends from “Configure”context-menu in the Package Explorer view.

Page 57: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

57

X10 Builders (2) – C++ back-end

Sockets, MPI, Parallel Environment, IBM LoadLeveler.

Local vs Remote (SSH) Communication Interfaces

Page 58: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

58

X10 Builders (3) – C++ back-end

Ability to customize compilation and linking for C++ generated files

Different sanity checks occur: SSH connection, compiler version, paths existence, etc…

Global validation available.

Page 59: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

59

X10 Launcher

Run As shortcuts for the current back-end selected

Dedicated X10launchconfiguration

Page 60: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

60

X10 Launcher (2)

Possibility totune communicationinterface optionsat launch-time

Number of Procs

Page 61: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

61

X10 Launcher (3)

X10 run with C++ back-end switches automatically to PTP Parallel Runtime Perspective for better monitoring

Page 62: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

62

Debugging (C++ backend)

Extends the Parallel Debugger in the Eclipse Parallel Tools Platform (PTP) with support for X10 source level debuggingBreakpoints, stepping, variable views, stack views

Page 63: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

63

Debugging (C++ backend)

Breakpoints

Variable View Execution controls

(resume, terminate, step into, step over, step return)

Page 64: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

64

Help Part 1: X10 Language Help

Various topics of X10 language help

Help -> Help Contents

Page 65: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

65

Help: Part 2: X10DT Help

Various help topics for installing, and

using X10DT

Help -> Help Contents

Page 66: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

66

Demo

Page 67: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

67

Lab Session I

Objectives– Install X10/X10DT on your laptop– Local build/launch of HelloWholeWorld– Optional remote build/launch of HelloWholeWorld– Explore code in provided tutorial workspace

After lunch– X10 case studies– Summary– Lab Session II

Page 68: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

68

X10 Tutorial Outline

X10 Overview

X10 on GPUs

X10 Tools

Lab Session I

LUNCH

Case Studies

Summary

Lab Session II

Page 69: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

69

Case Study Outline

LU FactorizationCommon communication/concurrency patterns in X10Teams (collective operations)

Global Load Balancing FrameworkUsage of finish/async/atDesign/usage of a generic framework

Global Matrix LibraryBuilding a scalable distributed data structure in X10API design for end-user productivityInternal communication/concurrency patterns

ANUChemExtending X10’s Array libraryExploitation of X10’s constrained types for performanceUse of distributed “collecting finish” idiom for scalability

All code available in the X10DT workspace; feel free to follow along…

Page 70: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

70

LU Factorization

• SPMD implementation• one top-level async per place

finish ateach (p in Dist.makeUnique())

• globally synchronous progress• collectives: barriers, broadcast,

reductions (row, column, WORLD)• point-to-point RDMAs for row swaps• uses BLAS for local computations

• HPL-like algorithm

• 2-d block-cyclic data distribution• right-looking variant with row partial

pivoting (no look ahead)• recursive panel factorization with pivot

search and column broadcast combined

Page 71: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

71

Key Features

• Distributed Data Structures• PlaceLocalHandle

• RemoteArray

• Communications• Array.asyncCopy

• Teams

• Optimizations• Finish Pragmas

• C bindings

Page 72: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

72

PlaceLocalHandle

PlaceLocalHandle: global handle to per-place objects

Example:

val di st = Di st. makeUni que() ;

val constructor = ( )=>new Array[Doubl e](N);

val buf fers = Pl aceLocal Handl e. make(di st, constructor) ;

• allocates a buffer array in each place

• returns a global handle to the collection of buffers

In each place buffers() evaluates to the local buffer

Page 73: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

73

RemoteArray and asyncCopy

RemoteArray: global reference to a fixed local array

Example:

val myBuf fer = buf fers() ;

val gl obal RefToMyBuf fer = new RemoteArray(buf fer) ;

Array.asyncCopy: asynchronous copy from source array to preallocated destination array (source or dest is local)

• uses RDMAs if available

• avoid local data copies

Page 74: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

74

Example

def exchange(row1: I nt, row2: I nt, mi n: I nt, max: I nt, dest: I nt) {

val sourceBuf fer = new RemoteArray(buf fers() ) ;

val si ze = matri x() . getRow(row1, mi n, max, buf fers() ) ; / / pack

f i ni sh {

at (Pl ace(dest)) async { / / go to dest

f i ni sh {

/ / copy f rom source to dest

Array. asyncCopy[Doubl e](sourceBuf fer, 0, buf fers() , 0, si ze) ;

} / / wai t for copy 1

matri x() . swapRow(row2, mi n, max, buf fers() ) ; / / swap

/ / copy f rom dest to source

Array. asyncCopy[Doubl e](buf fers() , 0, sourceBuf fer, 0, si ze)

}

} / / wai t for copi es 1 and 2

matri x() . setRow(row1, mi n, max, buf fers() ) ; / / store

}

Page 75: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

75

Teams

X10 Teams are like MPI communicators

• predefined Team.WORLD

• constructor

val team01 = new Team([Pl ace(0) , Pl ace(1)]) ;

• split method

col Rol e = here. i d / py;

rowRol e = here. i d % py;

col = Team. WORLD. spl i t(here. i d, rowRol e, col Rol e) ;

• operations: barrier, broadcast, indexOfMax, allreduce…

col . i ndexOfMax(col Rol e, Math. abs(max), i d) ;

col . addReduce(col Rol e, max, Team. MAX);

Page 76: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

76

C Bindings

@Nati veCPPI ncl ude("essl _nati ves. h") / / i ncl ude . h i n code generated f rom X10

@Nati veCPPCompi l ati onUni t("essl _nati ves. cc") / / l i nk wi th . cc

cl ass LU {

@Nati veCPPExtern nati ve stati c

def bl ockTri Sol ve(me: Array[Doubl e] , di ag: Array[Doubl e] , B: I nt) : voi d;

@Nati veCPPExtern nati ve stati c

def bl ockBackSol ve(me: Array[Doubl e] , di ag: Array[Doubl e] , B: I nt) : voi d;

@Nati veCPPExtern nati ve stati c

def bl ockMul Sub(me: Array[Doubl e] , l ef t: Array[Doubl e] ,

upper: Array[Doubl e] , B: I nt) : voi d;

• mapping by function names and argument positions

• supports primitive types and arrays of primitives

Page 77: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

77

Finish Pragmas

Optimize specific finish/async patterns

@Pragma(Pragma. FI NI SH_HERE) f i ni sh {

at (Pl ace(dest)) async {

@Pragma(Pragma. FI NI SH_ASYNC) f i ni sh {

Array. asyncCopy[Doubl e](sourceBuf fer, 0, buf fers() , 0, si ze) ;

}

matri x() . swapRow(row2, mi n, max, buf fers() ) ; / / swap

Array. asyncCopy[Doubl e](buf fers() , 0, sourceBuf fer, 0, si ze)

}

}

@Pragma(Pragma. FI NI SH_ASYNC) / / uni que async

@Pragma(Pragma. FI NI SH_HERE) / / l ast async to termi nate i s l ocal

@Pragma(Pragma. FI NI SH_LOCAL) / / no remote async

@Pragma(Pragma. FI NI SH_SPMD) / / no nested remote async

Page 78: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

78

Global Load Balancing

Page 79: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

79

Work-stealing in distributed memory

Autonomous stealing not possibleNeed help from NIC or victim

Sparsely distributed work is expensive to findNetwork congestion, wasted CPU-cycles

Non-trivial to detect terminationUsers need to implement their own

Well-studied, but hard to implement

One task stolen at a timeShallow trees necessitate repeated stealing

Requires aggregation of steals

Page 80: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

80

GLB

80

Main activity spawns an activity per place within a finish.

Augment work-stealing with work-distributing:Each activity processes. When no work left, steals at most w-times.(Thief) If no work found, visits at most z-lifelines in turn.

Donor records thief needs work(Thief) If no work found, dies.(Donor) Pushes work to recorded thieves, when it has some.

When all activities die, main activity continues to completion.

Caveat: This scheme assumes that work is relocatable

Page 81: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

81

GLB: state transitions

81

computing

computing

stealingstealing

deaddead

distributing

distributing

Can enter X10 to handle asynchronous requests

Page 82: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

82

GLB: sample execution

82

Number of steal attempts = 2Number of lifelines = 2

Page 83: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

83

GLB: lifeline graphs

83

Structure determinesHow soon work is propagatedHow much overhead is incurredRing network

P/2 hops on averageLow overhead

Desirable propertiesEach vertex must have bounded out-degreeConnectedLow diameter

Page 84: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

84

Lifeline graph: cyclic hyper-cubes

84

2020

1010

1111

0101

0000

0202

1212

2121

P = number of processorsChoose h and z such that P <= h^z

P=8, h=3, z=2

Numbers in base h

Each node has at most Z lifelines

Page 85: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

85

UTS: Results

85

Page 86: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

86

UTS: Results

861 attempt to steal before resorting to lifeline

Page 87: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

87

UTS: Results

87

22% more time stealing

83 attempts to steal before resorting to lifeline

Page 88: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

88

UTS: Results

88

87% efficiency

ANL’s BG/P

Page 89: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

89

UTS: Results

89

86% efficiency

Power7 on IB cluster

Page 90: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

90

UTS: Results

90

92% efficiency

IBM’s BG/P

Page 91: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

91

UTS: Results

91

89% efficiency

IBM’s Power7 with IB

Page 92: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

92

GLB Summary

GLB combines stealing and distributing

Scales with high efficiency to 2048 processors

X10’s UTS was ~350 lines of actual code

Future workAnalytical framework for lifeline graphs

Implement adaptive stealing

Scaling out finish to 100K processors

Hierarchical UTS with multiple threads per-node

92

Page 93: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

93

GLB: framework

93

static final class Fib2 extends TaskFrame[UInt, UInt] { public def runTask(t:UInt, s:Stack[Uint]):void offers UInt { if (t < 20u) offer fib(t); else { s.push(t-1); s.push(t-2); } } public def runRootTask(t:UInt, s:Stack[UInt]!):void offers UInt { runTask(t, s); } }

TaskFrame [T1, T2] is the user-interface for lifeline-based GLBT1 : Type of the work-packet T2 : Return type from the evaluation function

Page 94: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

94

Global Matrix Library

Page 95: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

95

Global Matrix Library (x10.gml)

• Object-oriented library implementing double-precision, blocked, sparse and dense matrices, distributed across multiple places

• Each kind of matrix implemented in a different class, all implement Matrix operations.

• Operations include cell-wise operations (addition, multiplication etc), matrix multiplication (using the SUMMA algorithm), Euclidean distance, etc.

• Presents standard sequential interface (internal parallelism and distribution).

• Written in X10, with native callouts to BLAS

• Runs on Managed Backend (multiple JVMs connected with sockets)

• Runs on Native Backend (on sockets/Ethernet or MPI/sockets, Infiniband).

Page 96: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

96

Global Matrix Library: PageRank

def step(G:…, P:…, GP:…, dupGP:…,

UP:…, P:…, E, :…) {

for (1..iteration) {

GP.mult(G,P)

.scale(alpha)

.copyInto(dupGP);//broadcast

P.local()

.mult(E, UP.mult(U,P.local()))

.scale(1-alpha)

.cellAdd(dupGP.local())

.sync(); // broadcast

}}

while (I < max_iteration) {

p = alpha*(G%*%p)+(1-alpha)*(e%*%u%*%p);

}

X10

DML

• Programmer determines distribution of the data.– G is block distributed per Row

– P is replicated

• Programmer writes sequential, distribution-aware code.

• Perform usual optimizations – allocate big data-structures outside loop

• Could equally well have been in Java.

Page 97: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

97

PageRank performance

PageRank Runtime

0.00

1.00

2.00

3.00

4.00

5.00

10 40 90 160 250 360 490 640 810 1000Nonzeros (million) in G

Runt

ime

(Sec

)

Managed-Socket

Native-Socket

Native-MPI

40 cores (8 places, 5 workers)

Page 98: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

98

Using GML: Gaussian Non-Negative Matrix Factorization

• Key kernel for topic modeling• Involves factoring a large (D x W)

matrix – D ~ 100M

– W ~ 100K, but sparse (0.001)

• Iterative algorithm, involves distributed sparse matrix multiplication, cell-wise matrix operations.

for (1..iteration) { H.cellMult(WV .transMult(W,V,tW) .cellDiv(WWH .mult(WW.transMult(W,W),H))); W.cellMult(VH .multTrans(V,H) .cellDiv(WHH .mult(W,HH.multTrans(H,H))));}

X10

• Key decision is representation for matrix, and its distribution.

– Note: app code is polymorphic in this choice.

P0

V W

H

H

HH

P1

P2Pn

Page 99: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

99

GNNMF Performance

GNMF runtime

0

10

20

30

40

50

60

100 200 300 400 500 600 700 800 900 1000Nonzeros in V (million)

Runt

ime(

sec)

Native-MPI

Native-Socket

Managed-Socket

40 cores (8 places, 5 workers)

Page 100: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

100

Linear Regression

for (1..iteration) { d_p.sync(); d_q.transMult(V, Vp.mult(V, d_p), d_q2, false); q.cellAdd(p1.scale(lambda)); alpha = norm_r2 / pq.transMult(p, q)(0, 0); p.copyTo(p1); w.cellAdd(p1.scale(alpha)); old_norm_r2 = norm_r2; r.cellAdd(q.scale(alpha)); norm_r2 = r.norm(r); beta = norm_r2/old_norm_r2; p.scale(beta).cellSub(r);}

while(i < max_iteration) { q = ((t(V) %*% (V %*% p)) + eps * p); alpha = norm_r2 / castAsScalar(t(p) %*% q); w = w + alpha * p; old_norm_r2 = norm_r2; r = r + alpha * q; norm_r2 = sum(r * r); beta = norm_r2 / old_norm_r2; p = -r + beta * p; i = i + 1;}

DML X10

Page 101: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

101

Linear regression

Linear Regression Benchmark

0

20

40

60

80

100

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000Nonzeros in V (million)

Runt

ime(

sec)

Native-MPI

Native-Socket

Managed-Socket

40 cores (8 places, 5 workers)

Page 102: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

102

Getting the GML Code

Entire GML library, example programs, and build/run scripts included in your tutorial workspace

Also available open source under EPL at x10-lang.orgCurrently only via svn (x10.gml project in x10 svn)

Packaged release and formal announcement coming soon

Page 103: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

103

ANU Chem

Page 104: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

104

ANUChem

Computational chemistry codes written in X10– FMM: electrostatic calculation using Fast Multipole method

• Included in tutorial workspace

– PME: electrostatic calculation using the Smooth Particle Mesh Ewald method

– Pumja Rasaayani: energy calculation using Hartree-Fock SCF-method

Developed by Josh Milthorpe and Ganesh Venkateshwara at ANU– http://cs.anu.edu.au/~Josh.Milthorpe/anuchem.html

– Open source (EPL); tracks X10 releases

– Around 16,000 lines of X10 code

– Overview and initial results in [Milthorpe et al IPDPS 2011]

• Code under active development; significant performance and scalability improvements in 2.2.1 release vs. results in the paper

Page 105: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

105

Goals for Case Study

Deeper understanding of X10 Arrays and DistArrays– Extending with application-specific behavior– Using X10’s constrained types to enable static

checking and optimization– DistArray as example of efficient distributed data

structure• Usage of PlaceLocalHandles• Usage of transient fields and caching

Concurrency/distribution in FMM; use of collecting finish

Page 106: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

106

Array and DistArray A Point is an element of an n-dimensional

Cartesian space (n>=1) with integer-valued coordinates e.g., [5], [1, 2], …

A Region is an ordered set of points, all of the same rank

An Array maps every Point in its defining Region to a corresponding data element.

Key operations

creation

apply/set (read/write elements)

map, reduce, scan

iteration, enhanced for loops

copy

A Dist (distribution) maps every Point in its Region to a Place

A DistArray (distributed array) is defined over a Dist and maps every Point in its distribution to a corresponding data element.

Data elements in the DistArray may only be accessed in their home location (the place to which the Dist maps their Point)

Key operations similar to Array, but bulk operations (map, reduce, etc) are distributed operations

With the exception of Array literals ( [1, 2, 3] === new Array[int](3,(i:int)=>i+1) ), Array and DistArray are implemented completely as library classes in the

x10.array package with no special compiler support.

Page 107: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

107

DistArrays in Detail

We’ll examine the code of DistArray, Dist, BlockDist– LocalStorage accessed by PlaceLocalHandle– Use of transient fields to control serialization– Indexing operations

• Key idea: Dist.offset is an abstract method that subclasses must implement that is responsible for mapping Points in the global index space into offsets in local storage

– Bulk operations (eg reduce) implemented internally using distributed collecting finish

Page 108: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

108

Extending DistArray: MortonDist

MortonDist (3D) used in FMM to define a DistArray that distributes octree boxes (key data structure) to X10 places

See au.edu.anu.mm.MortonDist.x10

Ordering of Points in 2D MortonDist of 0..7*0..7 2D MortonDist of 0..7*0..7 with 5 Places

Page 109: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

109

Extending DistArray: PeriodicDist

Enables modular (“wrap-around”) indexing at each Place

Any Dist can be made periodic by encapsulating it in a PeriodicDist. Can then be used in DistArray to get modular indexing

See x10.array.PeriodicDist

Page 110: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

110

Arrays in Detail

We’ll examine the implementation of x10.array.Array– Array indexing

• Specialized implementations based on properties. • Optimized for common case of Regions of rank<5, zero-

based and rectangular (no holes in index space)• General case fully supported with minimal time overhead on

indexing (space usage based on bounding box of Region)• Constructor does computations to optimize later indexing• If bounds-checks are disabled, Native X10 indexing

performance identical to equivalent C array– Array iteration

• Compiler optimizes iteration over rectangular Regions into counted for loops.

• Non-rectangular regions fully supported, but uses Point-based Region iterator API (significant performance overhead)

Page 111: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

111

FMM: ExpansionRegion

In FMM, a specialized region is used to represent the triangular-shaped region of multipole expansions

Enables error-checking and cleaner code when relevant Arrays defined over ExpansionRegion

Indexing performance good, but in a few key loop nests “hand-inlining” of ExpansionRegion iterator required for maximal performance. Eventually will be handled by X10 compiler improvements (well-known optimizations sufficient, but not yet implemented)

See au.edu.anu.mm.ExpansionRegion and Expansion

Page 112: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

112

FMM: Nested Collecting Finish

The FMM “downward pass” is implemented using an elegant pattern of nested distributed collecting finishes– Clean, easy to read code– Good scalability (tested to 512 places on BG/P)

Code to examine– au.edu.anu.mm.Fmm3d.downwardPass– au.edu.anu.mm.FmmBox.downward– au.edu.anu.mm.FmmLeafBox.downward

Page 113: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

113

X10 Tutorial Outline

X10 Overview

X10 on GPUs

X10 Tools

Lab Session I

LUNCH

Case Studies

Summary

Lab Session II

Page 114: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

114

Programmer Resources

x10-lang.org– Language specification– Online tutorials– X10-related classes and research papers– Performance Tuning Tips– FAQ

• Sample X10 programs– Small examples in X10 distribution and tutorials– X10 Benchmarks: optional download with X10 release– X10 Global Matrix Library– Other Applications

• ANUChem: http://cs.anu.edu.au/~Josh.Milthorpe/anuchem.html• Raytracer and OpenGL KMeans programs• Your application here…

Page 115: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

115

X10 Project Road Map

Upcoming Releases– X10 2.2.2 mid-December 2011

• Bug fixes and small improvements

– X10 2.3.0 first half of 2012• Java interoperability• Mixed-mode execution (single program running in mix of

Native and Managed places)• Will be backwards compatible with X10 2.2.x, but may

include language enhancements for determinacy

Page 116: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

116

X10 Community

Join the X10 community at x10-lang.org!x10-users mailing listList of upcoming events, papers, classes, users, etc.

X10/X10DT releases:http://x10-lang.org/software/download-x10

Try X10! Want to hear your experiences and feedback!

Very interested in publicizing X10 applications and benchmarks that can be shared with the community.

Page 117: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

117

Questions and Discussion

Questions?

Discussion?

Next up: Lab Session II…

Page 118: Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2011/SC11_X10...1 Developing Scalable Parallel Applications in X10 David Cunningham, David

118

Lab Session II

Objectives– Work on a more substantial X10 coding project of

interest to you– Some suggested projects

• Explore GlobalMatrixLibrary sample programs; write some new code using GML

• Try writing a new client of GlobalLoadBalancing• Try to parallelize (and then distribute) your favorite recursive

divide and conquer algorithm• Explore using X10DT by modifying sample programs• Try coding up your favorite problem in X10…