developing scalable parallel applications in...

1

Developing Scalable Parallel Applications in X10

David Cunningham, David Grove,Vijay Saraswat, and Olivier Tardieu

IBM Research

Based on material from previous X10 Tutorials by Bard Bloom, Evelyn Duesterwald Steve Fink, Nate Nystrom, Igor Peshansky,

Christoph von Praun, and Vivek SarkarAdditional contributions from Anju Kambadur, Josh Milthorpe and Juemin Zhang

This material is based upon work supported in part by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002.

Please see x10-lang.org for the most up-to-date version of these slides and sample programs.

2

Tutorial Objectives

Learn to develop scalable parallel applications using X10– X10 language and class library fundamentals– Effective patterns of concurrency and distribution– Case studies: draw material from real working code– Lab sessions: learn by doing

Exposure to Asynchronous PGAS programming model– Concepts broader than any single language

Updates on X10 project and community– Highlight what is new in X10 since SC’10– Gather feedback from you about X10

3

X10 Tutorial OutlineX10 Overview

X10 Fundamentals (~60 minutes)X10 Workstealing (~20 minutes)

X10 on GPUs (~40 minutes)

X10 Tools (~20 minutes)

Lab Session I

LUNCH

Four Case Studies (approximately 30 minutes each)LU FactorizationGlobal Load Balancing FrameworkGlobal Matrix LibrarySelections from ANUChem

Summary (~15 minutes)

Lab Session II (~45 minutes)

4

X10: Performance and Productivity at Scale

An evolution of Java for concurrency, scale out, and heterogeneity

The X10 language provides:– Ability to specify fine-grained concurrency

– Ability to distribute computation across large-scale clusters

– Ability to represent heterogeneity at the language level

– Simple programming model for computation offload

– Modern OO language features (build libraries/frameworks)

– Support for writing deadlock-free and determinate code

– Interoperability with Java

– Interoperability with native libraries (eg BLAS)

5

X10 Target Environments

High-end large clusters– BlueGene/P– Power7IH– x86 + IB, Power + IB– Goal: deliver scalable performance competitive with C+MPI

Medium-scale commodity systems– ~100 nodes (~1000 core and ~1 terabyte main memory)– Scale out environments, but MTBF is days not minutes– Programs that run in minutes/hours at this scale– Goal: deliver main-memory performance with simple

programming model (accessible to Java programmers)

Developer laptops– Linux, Max, Windows. Eclipse-based IDE, debugger, etc– Goal: support developer productivity

6

X10 Programming Model

Asynchronous PGAS (Partitioned Global Address Space)– Global address space, but with explicit locality– Activities can be dynamically created and joined

Abstracts hardware details, while still allowing programmer to express locality and concurrency.

Thread/Process

Address Space

Legend

Memory Reference

Parallel activities

…

…

7

X10 APGAS Constructs

Fine grained concurrency

• async S

Atomicity

• atomic S

• when (c) SPlace-shifting operations

• at (P) S

Ordering

• finish S

• clocked, next

Two central ideas: Places and Asynchrony

Language Support for Replication of Values Across Places

Distributed object model

Replicate by default

GlobalRef

Global data structures

DistArray

8

Hello Whole World import x10.io.Console;

class HelloWholeWorld { public static def main(Array[String]) { finish for (p in Place.places()) { async at (p)

Console.OUT.println("Hello World from place" +p.id); } }}

(%1) x10c++ -o HelloWholeWorld -O HelloWholeWorld.x10 (%1) x10c++ -o HelloWholeWorld -O HelloWholeWorld.x10

(%2) runx10 -n 4 HelloWholeWorld(%2) runx10 -n 4 HelloWholeWorldHello World from place 0Hello World from place 0Hello World from place 2Hello World from place 2Hello World from place 3Hello World from place 3Hello World from place 1Hello World from place 1

(%3)(%3)

9

Sequential MontyPiimport x10.io.Console;import x10.util.Random;

class MontyPi { public static def main(args:Array[String](1)) {

val N = Int.parse(args(0));val r = new Random();var result:Double = 0;for (1..N) {val x = r.nextDouble();val y = r.nextDouble();if (x*x + y*y <= 1) result++;

}val pi = 4*result/N;Console.OUT.println(“The value of pi is “ + pi);

}}

10

Concurrent MontyPiimport x10.io.Console;import x10.util.Random;


val N = Int.parse(args(0));val P = Int.parse(args(1));val result = new Cell[Double](0);finish for (1..P) async {val r = new Random();var myResult:Double = 0;for (1..(N/P)) {val x = r.nextDouble();val y = r.nextDouble();if (x*x + y*y <= 1) myResult++;

}atomic result() += myResult;

}val pi = 4*(result())/N;Console.OUT.println(“The value of pi is “ + pi);

}}

11

Distributed MontyPiimport x10.io.Console;import x10.util.Random;


val N = Int.parse(args(0));val result = GlobalRef[Cell[Double]](new Cell[Double](0));finish for (p in Place.places()) at (p) async {val r = new Random();var myResult:Double = 0;for (1..(N/Place.MAX_PLACES)) {val x = r.nextDouble(); val y = r.nextDouble();if (x*x + y*y <= 1) myResult++;

}val ans = myResult;at (result) atomic result()() += ans;

}val pi = 4*(result()())/N;Console.OUT.println(“The value of pi is “ + pi);

}}

12

X10 ProjectX10 is an open source project (http://x10-lang.org)

– Released under Eclipse Public License (EPL) v1– Current Release: 2.2.1 (September 2011)

X10 Implementations– C++ based (“Native X10”)

• Multi-process (one place per process; multi-node)• Linux, AIX, MacOS, Cygwin, BlueGene/P• x86, x86_64, PowerPC

– JVM based (“Managed X10”)• Multi-process (one place per JVM process; multi-node)

– Limitation on Windows to single process (single place)• Runs on any Java 5/Java 6 JVM

X10DT (X10 IDE) available for Windows, Linux, Mac OS X– Based on Eclipse 3.6– Supports many core development tasks including remote

build/execute facilities

http://x10-lang.org/

13

X10 Compilation Flow

X10Source

Parsing /Type Check

AST OptimizationsAST LoweringX10 AST

X10 AST

C++ CodeGeneration

Java CodeGeneration

C++ Source Java Source

C++ Compiler Java CompilerXRC XRJXRX

Native Code Bytecode

X10RT

X10 Compiler Front-End

C++ Back-End

Java Back-End

Java VMsNative Env

JNINative X10 Managed X10

14

Overview of Language Features

Many sequential features of Java inherited unchangedClasses (w/ single inheritance)

Interfaces, (w/ multiple inheritance) Instance and static fields

Constructors, (static) initializers Overloaded, over-rideable methods

Garbage collection

Familiar control structures, etc.

Structs

Closures

Multi-dimensional & distributed arrays

Substantial extensions to the type systemDependent types

Generic types

Function types

Type definitions, inference

ConcurrencyFine-grained concurrency:

async S;

Atomicity and synchronization

atomic S;

when (c) S;

Ordering

finish S;

Distribution

GlobalRef[T]

at (p) S;

15

Classes

Classes Single inheritance, multiple interfaces Static and instance methods Static val fields Instances fields may be val or var Values of class types may be null Heap allocated

16

Structs

User defined primitivesNo inheritance May implement

interfaces All fields are val All methods are final Allocated “inline” in

containing object, array or variable

Headerless

struct Complex { val real:Double; val img :Double; def this(r:Double, i:Double) { real = r; img = i; }

def operator + (that:Complex) { return Complex(real + that.real, img + that.img); }

....}

val x = new Array[Complex](1..10)

17

Function Types

(T1, T2, ..., Tn) => U

type of functions that take arguments Ti and returns U

If f: (T) => U and x: T

then invoke with f(x): U

Function types can be used as an interface Define apply operator with

the appropriate signature:

operator this(x:T): U

ClosuresFirst-class functions

(x: T): U => e

used in array initializers:

new Array[int](5, (i:int) => i*i )

yields the array [ 0, 1, 4, 9, 16 ]

18

Generic Types Classes, structs, and

interfaces may have type parameters

class ArrayList[T] Defines a type constructor

ArrayList and a family of types

ArrayList[Int], ArrayList[String], ArrayList[C], ...

ArrayList[C]: as if ArrayList class is copied and C substituted for T

Can instantiate on any type, including primitives (e.g., Int)

public final class ArrayList[T] {private var data:Array[T];private var ptr:int = 0;

public def this(n:int) {data = new Array[T](n);

}….public def addLast(elem:T) {

if (ptr == list.size) { …grow data… }data(ptr++) = elem;

}

public def get(idx:int) {if (idx < 0 || idx >= ptr) { throw …}return data(idx);

}}

19

Dependent Types Classes and structs may have

properties public val instance fields class Region(rank:Int,

zeroBased:Boolean, rect:Boolean) { ... }

Can constrain properties with a boolean expression Region{rank==3}

type of all regions with rank 3

Array[Int]{region==R} type of all arrays defined

over region R R must be a constant or a

final variable in scope at the type

Dependent types are checked statically.

Dependent types used to statically check locality properties

Dependent types used heavily in Array library to check invariants

Dependent type system is extensible

Dependent types enable optimization (compile-time constant propagation of properties)

20

Type inference Field and local variable types

inferred from initializer typeval x = 1;

x has type int{self==1}

val y = 1..2*1..10; y has type

Region{rank==2,rect}

Method return types inferred from method bodydef m() { ... return true ... return false ... } m has return type Boolean

Loop index types inferred from region

R: Region{rank==2}for (p in R) { ... }

p has type Point{rank==2}

Programming Tip: Often important to not declare types of local vals.

Enables type inference to compute the most precise possible type.

21

async

async S

Creates a new child activity that executes statement S

Returns immediately

S may reference variables in enclosing blocks

Activities cannot be named

Activity cannot be aborted or cancelled

Stmt ::= async Stmt

cf Cilk’s spawn

// Compute the Fibonacci// sequence in parallel.def fib(n:Int):Int {if (n < 2) return 1; val f1:Int;

val f2:Int;finish { async f1 = fib(n-1);f2 = fib(n-2);

}return f1+f2;

}

22

// Compute the Fibonacci// sequence in parallel.def fib(n:Int):Int {if (n < 2) return 1; val f1:Int;

val f2:Int;finish { async f1 = fib(n-1);f2 = fib(n-2);

}return f1+f2;

}

finish

finish S

Execute S, but wait until all (transitively) spawned asyncs have terminated.

Rooted exception model

Trap all exceptions thrown by spawned activities.

Throw an (aggregate) exception if any spawned async terminates abruptly.

implicit finish at main activity

finish is useful for expressing “synchronous” operations on (local or) remote data.

Stmt ::= finish Stmt

cf Cilk’s sync

23

// push data onto concurrent // list-stackval node = new Node(data);atomic {node.next = head;head = node;

}

atomicatomic S

Execute statement S atomically

Atomic blocks are conceptually executed in a serialized order with respect to all other atomic blocks: isolation and weak atomicity.

An atomic block body (S) ... must be nonblocking must not create concurrent

activities (sequential) must not access remote data

(local) restrictions checked dynamically

// target defined in lexically// enclosing scope.atomic def CAS(old:Object, n:Object) {if (target.equals(old)) {target = n;return true;

}return false;

}

Stmt ::= atomic StatementMethodModifier ::= atomic

24

when

when (E) S

Activity suspends until a state inwhich the guard E is true.

In that state, S is executed atomically and in isolation.

Guard E is a boolean expression must be nonblocking must not create concurrent

activities (sequential) must not access remote data

(local) must not have side-effects (const)

Stmt ::= WhenStmtWhenStmt ::= when ( Expr ) Stmt | WhenStmt or (Expr) Stmt

class OneBuffer {var datum:Object = null;var filled:Boolean = false;def send(v:Object) { when ( !filled ) {datum = v;filled = true;

}}def receive():Object {when ( filled ) {val v = datum;datum = null;filled = false;return v;

}}

}

25

at

at (p) S

Execute statement S at place p

Current activity is blocked until S completes

Deep copy of local object graph to target place; the variables referenced from S define the roots of the graph to be copied.

GlobalRef[T] used to create remote pointers across at boundaries

Stmt ::= at (p) Stmt

class C { var x:int;def this(n:int) { x = n; }

}

// Increment remote counterdef inc(c:GlobalRef[C]) {at (c.home) c().x++;

}

// Create GR of Cstatic def make(init:int) {val c = new C(init);return GlobalRef[C](c);

}

26

Distributed Object Model

Objects live in a single place

Objects can only be accessed in the place where they live

Object graph reachable from the statements in an “at” will be deeply copied from the source to destination place and a new isomorphic graph created

Instances fields of an object may be declared transient to break deep copy

GlobalRef[T] enables cross-place pointers to be created

GlobalRef can be freely copied, but referent can only be accessed via the apply at its home place

public struct GlobalRef[T](home:Place){T<:Object} {

public native def this(t:T):GlobalRef{self.home==here};

public native operator this(){here == this.home}:T;

}

27

Summary

Whirlwind tour through X10 languageAdditional details will be introduced via case studies

See http://x10-lang.org for specification, samples, etc.

Next topic: alternate implementations of async/finish in the X10 runtime and their implications for programmers

Questions so far?

http://x10-lang.org/

28

X10 Runtime

One process per place

• choice inter-process communication library (x10rt)• standalone (shared mem), sockets (tcp/ip), mpi, PAMI

Pool of worker threads per place

• runtime scheduler maps asyncs to threads (many to 1)

• choice of two schedulers• fork-join scheduler (default, ~ Java Fork-Join)• work-stealing scheduler (work in progress, ~ Cilk)

• to use compile with “x10c++ -WORK-STEALING”

29

X10 Schedulers

• each worker maintains a queue of pending jobs• async p; q…

• FJ: worker pushes p, runs q… (job = async)

• WS: worker pushes q…, runs p (job = continuation)

• idle worker• pops job from queue if any• steals job for random other worker if empty (spin)

• queues are double ended• push and pop from one end, steal from other end

=> no contention for statically balanced workloads• less contention if queues are non-empty

30

Blocking Constructs

• finish typically does not block worker • WS: run all subtasks first• FJ: run subtask in queue instead of blocking

• finish may block because of thieves or remote async• worker 1 reaches end of finish block while worker 2 is

still running async

• context switch (blocking finish, when)• FJ: OS-level context switch

• worker thread suspends, another worker thread is awaken

• WS: runtime-level context switch• runtime saves worker’s state, assigns new job to worker

31

• async• queue operations (fences, at worst CAS)

• finish (single place)• synchronized counter (FJ: always, WS: if >1 worker)

• FJ relies on helper objects for finish and async• overhead of object creation + heap allocation + GC

• WS relies on compiler transformation (CPS)• sequential overhead (ex: 1.3x for QuickSort)• steal overhead (~ context switch)

Cost Analysis

32

Pros and Cons

• WS+ much more scalable with fine grain concurrency

+ fixed-size thread pool

- sequential overhead

• FJ- requires coarser tasks

- makes load balancing harder

+ more “obvious” performance model

+ more mature implementation (for now)

33

X10 Tutorial Outline

X10 Overview

X10 on GPUsCUDA via APGAS

K-Means example & demo

X10 Tools

Lab Session I

LUNCH

Case Studies

Summary

Lab Session II

34

Why Program GPUs with X10?

Why program the GPU at all?

Many times faster for certain classes of applications

Hardware is cheap and widely available (as opposed to e.g. FPGA)

Why write X10 (instead of CUDA/OpenCL)?

Use the same programming model (APGAS) on GPU and CPU

Easier to write parallel/distributed GPU-aware programs

Better type safety

Higher level abstractions => fewer lines of code

35

GPU programming will never be as easy as CPU.

We try to make it as easy as possible.

Correctness

Can debug X10 kernel code on CPU first, using standard techniques

Static errors avoid certain classes of faults.

Goal: Eliminate all GPU segfaults with negligible overhead.

Currently detect all dereferences of non-local data (place errors)

TODO: Static array bounds checking

TODO: Static null-pointer checking

Performance: Need understanding of the CUDA performance model

Must know advantages+limitations of [registers, SHM, global memory]

Avoid warp divergence

Avoid irregular/misaligned memory access

Use CUDA profiling tool to debug kernel performance (very easy and usable)

Can inspect and disassemble generated cubin file

Easier to tune blocks/threads using auto-configuration

X10/GPU Programming Experience

36

Why target CUDA and not OpenCL?

In early 2009, OpenCL support was limited

CUDA is based on C++

OpenCL based on C

C++ features help us implement X10 features (e.g. generics)

By targeting CUDA we can re-use parts of existing C++ backend

CUDA evolving fast, more interesting for high level languages (e.g. OOP)

However there are advantages of OpenCL too...

We could support it in future

37

APGAS model ↔ GPU model

at

Java-like sequential constructs

finish

clocks

GPU / host divide

kernel invocation

Shared memConstant mem

Local mem DMAs

__syncthreads()

blocks

threads

async

38

GPU / Host divide

Host Host Host Host

CUDA CUDACUDA CUDA CUDA

High Perf. Network

Goals

Re-use existing language concepts wherever possible

Independent GPU memory space new place⇒

Place count and topology unknown until run-time

Same X10 code works on different run-time configurations

Could use same approach for OpenCL, Cell, FPGA, …

PCI bus

A 'place' inAPGAS model

39

X10/CUDA code (mass sqrt example)for (host in Place.places()) at (host) { val init = new Array[Float](1000, (i:Int)=>i as Float); val recv = new Array[Float](1000); for (accel in here.children().values()) { val remote = CUDAUtilities.makeRemoteArray[Float](accel, 1000, (Int)=>0.0f); val num_blocks = 8, num_threads = 64; finish async at (accel) @CUDA { finish for ([block] in 0..(num_blocks-1)) async { clocked finish for ([thread] in 0..(num_threads-1)) clocked async { val tid = block*num_threads + thread; val tids = num_blocks*num_threads; for (var i:Int=tid ; i<1000 ; i+=tids) { remote(i) = Math.sqrtf(init(i)); } } } } // Console.OUT.println(remote(42)); finish Array.asyncCopy(remote, 0, recv, 0, recv.size); Console.OUT.println(recv(42)); }}

GP

U co

de

Alloc on GPU

Discover GPUs

Copy result to host

Static type error

Implicit captureand transfer to GPU

40

__gl obal __ voi d ker nel ( . . . ){ . . .}

ker nel <<<num_bl ocks, num_t hr eads>>>( . . . )

CUDA threads as APGAS activities

41

CUDA threads as APGAS activities

finish async at (p) @CUDA {

finish for ([block] in 0..(blocks-1)) async {

clocked finish for ([thread] in 0..(threads-1)) clocked async {

val tid = block*threads + thread; val tids = blocks*threads;

for (var i:Int=tid ; i<len ; i+=tids) { remote(i) = Math.sqrtf(init(i)); }

}

}

}

Enforces language restrictions

Define kernel shape

Only sequential code

Only primitive types / arrays

No allocation, no dynamic dispatch

CU

DA

block

CU

DA

kernel

CU

DA

thread

42

Dynamic Shared Memory & barrier


finish for (block in 0..(blocks-1)) async {

val shm = new Array[Int](threads, (Int)=>0);

clocked finish for (thread in 0..(threads-1)) clocked async {

shm(thread) = f(thread); Clock.advanceAll(); val tmp = shm((thread+1) % threads);

} } }

Compiled to CUDA'shared' memory

Initialised on GPU(once per block)

General philosophy:

Use existing X10 constructs to express CUDA semantics

APGAS model is sufficient

CUDA barrierrepresentedwith clock op

43

Constant cache memory

val arr = new Array[Int](threads, (Int)=>0);


val cmem : Sequence[Int] = arr.sequence();



val tmp = cmem(42);

} } }

Compiled to CUDA'constant' cache

Automatically uploadedto GPU (just beforekernel invocation)

('Sequence' is an immutable array)

General philosophy:

Quite similar to 'shared' memory

Again, APGAS model is sufficient

44

Blocks/Threads auto-config


val blocks = CUDAUtilities.autoBlocks(), threads = CUDAUtilities.autoThreads();



// kernel cannot assume any particular // values for 'blocks' and 'threads'

}

}

}

Maximise 'occupancy'at run-time according to

a simple heuristic

If called on the CPU, autoBlocks() == 1, autoThreads() == 8

45

KMeansCUDA kernel

finish async at (gpu) @CUDA { val blocks = CUDAUtilities.autoBlocks(), threads = CUDAUtilities.autoThreads(); finish for ([block] in 0..(blocks-1)) async { val clustercache = new Array[Float](num_clusters*dims, clusters_copy); clocked finish for ([thread] in 0..(threads-1)) clocked async { val tid = block * threads + thread; val tids = blocks * threads; for (var p:Int=tid ; p<num_local_points ; p+=tids) { var closest:Int = -1; var closest_dist:Float = Float.MAX_VALUE; @Unroll(20) for ([k] in 0..(num_clusters-1)) { var dist : Float = 0; for ([d] in 0..(dims-1)) { val tmp = gpu_points(p+d*num_local_points_stride) - clustercache(k*dims+d); dist += tmp * tmp; } if (dist < closest_dist) { closest_dist = dist; closest = k; } } gpu_nearest(p) = closest; } } } }

SOA form to allow regular accesses

Pad for alignment (stride includes padding)

Strip-mine outer loop

Cache clusters in SHM

Drag inputs from host each iteration

Allocated earlier(via CUDAUtilities)

Unroll this loop

Above Kernel is only part of the story...Use kernel once per iteration of Lloyd's Alg.Copy gpu_nearest to CPU when doneRest of algorithm runs faster on the CPU!

46

K-Means graphical demo

Map of USA (can scroll / zoom)

300 Million points (2 dimensional space)

Data auto-generated from zip-code population dataTown pop. normally distributed around town centers

Use K-Means to find 50 Centroids for USA populationPossible use case: Delivery distribution warehouses

Visualization uses GLRuns on the GPU

Completely independent from CUDA

GPUs can be used for graphics too! :)

Can compare K-Means performance on CPU and GPU

472M 100 2M 400 4M 100 4M 400

0

0.5

1

1.5

2

2.5

3

1x11x22x12x2

K-Means End-to-End Performance

Colours show scaling up to 4 Tesla GPUsY-axis gives Gflops normalized WRT native CUDA

implementation (single GPU, host)

2M/4M points, K=100/400, 4 dimensionsHigher K more GPU work better scaling⇒ ⇒

Analysis

Performance sensitive to many factors:CPU code: fragile g++ optimisationsX10/CUDA DMA bandwidth (GPUs share PCI bus)Infiniband for host<->host communication (also share PCI bus)nvcc register allocation fragile

Outstanding DMA issue (~40% bandwidth loss)CUDA requires X10 CPU objects allocated with cudaMallocHard to integrate with existing libraries / GCCurrently staging DMAs through pre-allocated buffer...complaints by users on CUDA forums

Options for further improving performance:Use multicore for CPU partsHide DMAs

Hosts x GPUs (per host)

48

Future WorkSupport texture cache...

Explicit allocation: val t = Texture.make(...)at (gpu) { … t(x, y) … }

Fermi architecture presents obvious opportunities for supporting X10Indirect branches Object orientation and Exceptions⇒NVidia's goal is the same as ours – allow high level GPU programming

OpenCLNo conceptual hurdlesDifferent code-gen and runtime workX10 programming experience should be very similar

Memory Allocation on GPULatest CUDA has a 'malloc' call we might use

GPU GC cycle between kernel invocationsA research project in itself

atomic operations

GL interoperability

49


X10 Overview

X10 on GPUs

X10 ToolsOverviewDemo of X10DT and X10 Debugger

Lab Session I

LUNCH

Case Studies

Summary

Lab Session II

50

X10 DT: Integrated Development Tools for the X10 Programmer

Building

Debugging

Editing

Help

A modern development environment for productive parallel programming

Source navigation, syntax highlighting, parsing

errors, folding, hyperlinking, outline &

quick outline, hover help, content assist, type

hierarchy, format, search, call graph, quick fixes

Local and remote launch

- Java/C++ back-end support- Remote target configuration

X10 programmer guide

X10DT usage guide

Integrated parallel

X10 debugging Launching

51

X10 Editor plus a set of common IDE servicesSource navigation

Syntax highlighting

Parsing errors

Folding

Hyperlinking

Outline & quick outline

Hover help

Content assist

More services in progress:Type hierarchy view

Format

Search

Call graph view

Quick fixes

Editing

52

Editing Services

Outline view

Source folding

Hover help

Syntax highlighting

53

EditingServices (cont.)

Content-Assist (Ctrl-space)

54


Early parsing errors

55


Hyperlinking(Ctrl – mouseclick)

opens up new window

56

X10DT Builders

Support both X10 back-ends: Java and C++

Ability to switch back-ends from “Configure”context-menu in the Package Explorer view.

57

X10 Builders (2) – C++ back-end

Sockets, MPI, Parallel Environment, IBM LoadLeveler.

Local vs Remote (SSH) Communication Interfaces

58

X10 Builders (3) – C++ back-end

Ability to customize compilation and linking for C++ generated files

Different sanity checks occur: SSH connection, compiler version, paths existence, etc…

Global validation available.

59

X10 Launcher

Run As shortcuts for the current back-end selected

Dedicated X10launchconfiguration

60

X10 Launcher (2)

Possibility totune communicationinterface optionsat launch-time

Number of Procs

61

X10 Launcher (3)

X10 run with C++ back-end switches automatically to PTP Parallel Runtime Perspective for better monitoring

62

Debugging (C++ backend)

Extends the Parallel Debugger in the Eclipse Parallel Tools Platform (PTP) with support for X10 source level debuggingBreakpoints, stepping, variable views, stack views

63

Debugging (C++ backend)

Breakpoints

Variable View Execution controls

(resume, terminate, step into, step over, step return)

64

Help Part 1: X10 Language Help

Various topics of X10 language help

Help -> Help Contents

65

Help: Part 2: X10DT Help

Various help topics for installing, and

using X10DT

Help -> Help Contents

66

Demo

67

Lab Session I

Objectives– Install X10/X10DT on your laptop– Local build/launch of HelloWholeWorld– Optional remote build/launch of HelloWholeWorld– Explore code in provided tutorial workspace

After lunch– X10 case studies– Summary– Lab Session II

68


X10 Overview

X10 on GPUs

X10 Tools

Lab Session I

LUNCH

Case Studies

Summary

Lab Session II

69

Case Study Outline

LU FactorizationCommon communication/concurrency patterns in X10Teams (collective operations)

Global Load Balancing FrameworkUsage of finish/async/atDesign/usage of a generic framework

Global Matrix LibraryBuilding a scalable distributed data structure in X10API design for end-user productivityInternal communication/concurrency patterns

ANUChemExtending X10’s Array libraryExploitation of X10’s constrained types for performanceUse of distributed “collecting finish” idiom for scalability

All code available in the X10DT workspace; feel free to follow along…

70

LU Factorization

• SPMD implementation• one top-level async per place

finish ateach (p in Dist.makeUnique())

• globally synchronous progress• collectives: barriers, broadcast,

reductions (row, column, WORLD)• point-to-point RDMAs for row swaps• uses BLAS for local computations

• HPL-like algorithm

• 2-d block-cyclic data distribution• right-looking variant with row partial

pivoting (no look ahead)• recursive panel factorization with pivot

search and column broadcast combined

71

Key Features

• Distributed Data Structures• PlaceLocalHandle

• RemoteArray

• Communications• Array.asyncCopy

• Teams

• Optimizations• Finish Pragmas

• C bindings

72

PlaceLocalHandle

PlaceLocalHandle: global handle to per-place objects

Example:

val di st = Di st. makeUni que() ;

val constructor = ( )=>new Array[Doubl e](N);

val buf fers = Pl aceLocal Handl e. make(di st, constructor) ;

• allocates a buffer array in each place

• returns a global handle to the collection of buffers

In each place buffers() evaluates to the local buffer

73

RemoteArray and asyncCopy

RemoteArray: global reference to a fixed local array

Example:

val myBuf fer = buf fers() ;

val gl obal RefToMyBuf fer = new RemoteArray(buf fer) ;

Array.asyncCopy: asynchronous copy from source array to preallocated destination array (source or dest is local)

• uses RDMAs if available

• avoid local data copies

74

Example

def exchange(row1: I nt, row2: I nt, mi n: I nt, max: I nt, dest: I nt) {

val sourceBuf fer = new RemoteArray(buf fers() ) ;

val si ze = matri x() . getRow(row1, mi n, max, buf fers() ) ; / / pack

f i ni sh {

at (Pl ace(dest)) async { / / go to dest

f i ni sh {

/ / copy f rom source to dest

Array. asyncCopy[Doubl e](sourceBuf fer, 0, buf fers() , 0, si ze) ;

} / / wai t for copy 1

matri x() . swapRow(row2, mi n, max, buf fers() ) ; / / swap

/ / copy f rom dest to source

Array. asyncCopy[Doubl e](buf fers() , 0, sourceBuf fer, 0, si ze)

}

} / / wai t for copi es 1 and 2

matri x() . setRow(row1, mi n, max, buf fers() ) ; / / store

}

75

Teams

X10 Teams are like MPI communicators

• predefined Team.WORLD

• constructor

val team01 = new Team([Pl ace(0) , Pl ace(1)]) ;

• split method

col Rol e = here. i d / py;

rowRol e = here. i d % py;

col = Team. WORLD. spl i t(here. i d, rowRol e, col Rol e) ;

• operations: barrier, broadcast, indexOfMax, allreduce…

col . i ndexOfMax(col Rol e, Math. abs(max), i d) ;

col . addReduce(col Rol e, max, Team. MAX);

76

C Bindings

@Nati veCPPI ncl ude("essl _nati ves. h") / / i ncl ude . h i n code generated f rom X10

@Nati veCPPCompi l ati onUni t("essl _nati ves. cc") / / l i nk wi th . cc

cl ass LU {

@Nati veCPPExtern nati ve stati c

def bl ockTri Sol ve(me: Array[Doubl e] , di ag: Array[Doubl e] , B: I nt) : voi d;


def bl ockBackSol ve(me: Array[Doubl e] , di ag: Array[Doubl e] , B: I nt) : voi d;


def bl ockMul Sub(me: Array[Doubl e] , l ef t: Array[Doubl e] ,

upper: Array[Doubl e] , B: I nt) : voi d;

• mapping by function names and argument positions

• supports primitive types and arrays of primitives

77

Finish Pragmas

Optimize specific finish/async patterns

@Pragma(Pragma. FI NI SH_HERE) f i ni sh {

at (Pl ace(dest)) async {

@Pragma(Pragma. FI NI SH_ASYNC) f i ni sh {

Array. asyncCopy[Doubl e](sourceBuf fer, 0, buf fers() , 0, si ze) ;

}

matri x() . swapRow(row2, mi n, max, buf fers() ) ; / / swap

Array. asyncCopy[Doubl e](buf fers() , 0, sourceBuf fer, 0, si ze)

}

}

@Pragma(Pragma. FI NI SH_ASYNC) / / uni que async

@Pragma(Pragma. FI NI SH_HERE) / / l ast async to termi nate i s l ocal

@Pragma(Pragma. FI NI SH_LOCAL) / / no remote async

@Pragma(Pragma. FI NI SH_SPMD) / / no nested remote async

78

Global Load Balancing

79

Work-stealing in distributed memory

Autonomous stealing not possibleNeed help from NIC or victim

Sparsely distributed work is expensive to findNetwork congestion, wasted CPU-cycles

Non-trivial to detect terminationUsers need to implement their own

Well-studied, but hard to implement

One task stolen at a timeShallow trees necessitate repeated stealing

Requires aggregation of steals

80

GLB

80

Main activity spawns an activity per place within a finish.

Augment work-stealing with work-distributing:Each activity processes. When no work left, steals at most w-times.(Thief) If no work found, visits at most z-lifelines in turn.

Donor records thief needs work(Thief) If no work found, dies.(Donor) Pushes work to recorded thieves, when it has some.

When all activities die, main activity continues to completion.

Caveat: This scheme assumes that work is relocatable

81

GLB: state transitions

81

computing

computing

stealingstealing

deaddead

distributing

distributing

Can enter X10 to handle asynchronous requests

82

GLB: sample execution

82

Number of steal attempts = 2Number of lifelines = 2

83

GLB: lifeline graphs

83

Structure determinesHow soon work is propagatedHow much overhead is incurredRing network

P/2 hops on averageLow overhead

Desirable propertiesEach vertex must have bounded out-degreeConnectedLow diameter

84

Lifeline graph: cyclic hyper-cubes

84

2020

1010

1111

0101

0000

0202

1212

2121

P = number of processorsChoose h and z such that P <= h^z

P=8, h=3, z=2

Numbers in base h

Each node has at most Z lifelines

85

UTS: Results

85

86

UTS: Results

861 attempt to steal before resorting to lifeline

87

UTS: Results

87

22% more time stealing

83 attempts to steal before resorting to lifeline

88

UTS: Results

88

87% efficiency

ANL’s BG/P

89

UTS: Results

89

86% efficiency

Power7 on IB cluster

90

UTS: Results

90

92% efficiency

IBM’s BG/P

91

UTS: Results

91

89% efficiency

IBM’s Power7 with IB

92

GLB Summary

GLB combines stealing and distributing

Scales with high efficiency to 2048 processors

X10’s UTS was ~350 lines of actual code

Future workAnalytical framework for lifeline graphs

Implement adaptive stealing

Scaling out finish to 100K processors

Hierarchical UTS with multiple threads per-node

92

93

GLB: framework

93

static final class Fib2 extends TaskFrame[UInt, UInt] { public def runTask(t:UInt, s:Stack[Uint]):void offers UInt { if (t < 20u) offer fib(t); else { s.push(t-1); s.push(t-2); } } public def runRootTask(t:UInt, s:Stack[UInt]!):void offers UInt { runTask(t, s); } }

TaskFrame [T1, T2] is the user-interface for lifeline-based GLBT1 : Type of the work-packet T2 : Return type from the evaluation function

94

Global Matrix Library

95

Global Matrix Library (x10.gml)

• Object-oriented library implementing double-precision, blocked, sparse and dense matrices, distributed across multiple places

• Each kind of matrix implemented in a different class, all implement Matrix operations.

• Operations include cell-wise operations (addition, multiplication etc), matrix multiplication (using the SUMMA algorithm), Euclidean distance, etc.

• Presents standard sequential interface (internal parallelism and distribution).

• Written in X10, with native callouts to BLAS

• Runs on Managed Backend (multiple JVMs connected with sockets)

• Runs on Native Backend (on sockets/Ethernet or MPI/sockets, Infiniband).

96

Global Matrix Library: PageRank

def step(G:…, P:…, GP:…, dupGP:…,

UP:…, P:…, E, :…) {

for (1..iteration) {

GP.mult(G,P)

.scale(alpha)

.copyInto(dupGP);//broadcast

P.local()

.mult(E, UP.mult(U,P.local()))

.scale(1-alpha)

.cellAdd(dupGP.local())

.sync(); // broadcast

}}

while (I < max_iteration) {

p = alpha*(G%*%p)+(1-alpha)*(e%*%u%*%p);

}

X10

DML

• Programmer determines distribution of the data.– G is block distributed per Row

– P is replicated

• Programmer writes sequential, distribution-aware code.

• Perform usual optimizations – allocate big data-structures outside loop

• Could equally well have been in Java.

97

PageRank performance

PageRank Runtime

0.00

1.00

2.00

3.00

4.00

5.00

10 40 90 160 250 360 490 640 810 1000Nonzeros (million) in G

Runt

ime

(Sec

)

Managed-Socket

Native-Socket

Native-MPI

40 cores (8 places, 5 workers)

98

Using GML: Gaussian Non-Negative Matrix Factorization

• Key kernel for topic modeling• Involves factoring a large (D x W)

matrix – D ~ 100M

– W ~ 100K, but sparse (0.001)

• Iterative algorithm, involves distributed sparse matrix multiplication, cell-wise matrix operations.

for (1..iteration) { H.cellMult(WV .transMult(W,V,tW) .cellDiv(WWH .mult(WW.transMult(W,W),H))); W.cellMult(VH .multTrans(V,H) .cellDiv(WHH .mult(W,HH.multTrans(H,H))));}

X10

• Key decision is representation for matrix, and its distribution.

– Note: app code is polymorphic in this choice.

P0

V W

H

H

HH

P1

P2Pn

99

GNNMF Performance

GNMF runtime

0

10

20

30

40

50

60

100 200 300 400 500 600 700 800 900 1000Nonzeros in V (million)

Runt

ime(

sec)

Native-MPI

Native-Socket

Managed-Socket


100

Linear Regression

for (1..iteration) { d_p.sync(); d_q.transMult(V, Vp.mult(V, d_p), d_q2, false); q.cellAdd(p1.scale(lambda)); alpha = norm_r2 / pq.transMult(p, q)(0, 0); p.copyTo(p1); w.cellAdd(p1.scale(alpha)); old_norm_r2 = norm_r2; r.cellAdd(q.scale(alpha)); norm_r2 = r.norm(r); beta = norm_r2/old_norm_r2; p.scale(beta).cellSub(r);}

while(i < max_iteration) { q = ((t(V) %*% (V %*% p)) + eps * p); alpha = norm_r2 / castAsScalar(t(p) %*% q); w = w + alpha * p; old_norm_r2 = norm_r2; r = r + alpha * q; norm_r2 = sum(r * r); beta = norm_r2 / old_norm_r2; p = -r + beta * p; i = i + 1;}

DML X10

101

Linear regression

Linear Regression Benchmark

0

20

40

60

80

100

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000Nonzeros in V (million)

Runt

ime(

sec)

Native-MPI

Native-Socket

Managed-Socket


102

Getting the GML Code

Entire GML library, example programs, and build/run scripts included in your tutorial workspace

Also available open source under EPL at x10-lang.orgCurrently only via svn (x10.gml project in x10 svn)

Packaged release and formal announcement coming soon

103

ANU Chem

104

ANUChem

Computational chemistry codes written in X10– FMM: electrostatic calculation using Fast Multipole method

• Included in tutorial workspace

– PME: electrostatic calculation using the Smooth Particle Mesh Ewald method

– Pumja Rasaayani: energy calculation using Hartree-Fock SCF-method

Developed by Josh Milthorpe and Ganesh Venkateshwara at ANU– http://cs.anu.edu.au/~Josh.Milthorpe/anuchem.html

– Open source (EPL); tracks X10 releases

– Around 16,000 lines of X10 code

– Overview and initial results in [Milthorpe et al IPDPS 2011]

• Code under active development; significant performance and scalability improvements in 2.2.1 release vs. results in the paper

http://cs.anu.edu.au/~Josh.Milthorpe/anuchem.html

105

Goals for Case Study

Deeper understanding of X10 Arrays and DistArrays– Extending with application-specific behavior– Using X10’s constrained types to enable static

checking and optimization– DistArray as example of efficient distributed data

structure• Usage of PlaceLocalHandles• Usage of transient fields and caching

Concurrency/distribution in FMM; use of collecting finish

106

Array and DistArray A Point is an element of an n-dimensional

Cartesian space (n>=1) with integer-valued coordinates e.g., [5], [1, 2], …

A Region is an ordered set of points, all of the same rank

An Array maps every Point in its defining Region to a corresponding data element.

Key operations

creation

apply/set (read/write elements)

map, reduce, scan

iteration, enhanced for loops

copy

A Dist (distribution) maps every Point in its Region to a Place

A DistArray (distributed array) is defined over a Dist and maps every Point in its distribution to a corresponding data element.

Data elements in the DistArray may only be accessed in their home location (the place to which the Dist maps their Point)

Key operations similar to Array, but bulk operations (map, reduce, etc) are distributed operations

With the exception of Array literals ( [1, 2, 3] === new Array[int](3,(i:int)=>i+1) ), Array and DistArray are implemented completely as library classes in the

x10.array package with no special compiler support.

107

DistArrays in Detail

We’ll examine the code of DistArray, Dist, BlockDist– LocalStorage accessed by PlaceLocalHandle– Use of transient fields to control serialization– Indexing operations

• Key idea: Dist.offset is an abstract method that subclasses must implement that is responsible for mapping Points in the global index space into offsets in local storage

– Bulk operations (eg reduce) implemented internally using distributed collecting finish

108

Extending DistArray: MortonDist

MortonDist (3D) used in FMM to define a DistArray that distributes octree boxes (key data structure) to X10 places

See au.edu.anu.mm.MortonDist.x10

Ordering of Points in 2D MortonDist of 0..7*0..7 2D MortonDist of 0..7*0..7 with 5 Places

109

Extending DistArray: PeriodicDist

Enables modular (“wrap-around”) indexing at each Place

Any Dist can be made periodic by encapsulating it in a PeriodicDist. Can then be used in DistArray to get modular indexing

See x10.array.PeriodicDist

110

Arrays in Detail

We’ll examine the implementation of x10.array.Array– Array indexing

• Specialized implementations based on properties. • Optimized for common case of Regions of rank<5, zero-

based and rectangular (no holes in index space)• General case fully supported with minimal time overhead on

indexing (space usage based on bounding box of Region)• Constructor does computations to optimize later indexing• If bounds-checks are disabled, Native X10 indexing

performance identical to equivalent C array– Array iteration

• Compiler optimizes iteration over rectangular Regions into counted for loops.

• Non-rectangular regions fully supported, but uses Point-based Region iterator API (significant performance overhead)

111

FMM: ExpansionRegion

In FMM, a specialized region is used to represent the triangular-shaped region of multipole expansions

Enables error-checking and cleaner code when relevant Arrays defined over ExpansionRegion

Indexing performance good, but in a few key loop nests “hand-inlining” of ExpansionRegion iterator required for maximal performance. Eventually will be handled by X10 compiler improvements (well-known optimizations sufficient, but not yet implemented)

See au.edu.anu.mm.ExpansionRegion and Expansion

112

FMM: Nested Collecting Finish

The FMM “downward pass” is implemented using an elegant pattern of nested distributed collecting finishes– Clean, easy to read code– Good scalability (tested to 512 places on BG/P)

Code to examine– au.edu.anu.mm.Fmm3d.downwardPass– au.edu.anu.mm.FmmBox.downward– au.edu.anu.mm.FmmLeafBox.downward

113


X10 Overview

X10 on GPUs

X10 Tools

Lab Session I

LUNCH

Case Studies

Summary

Lab Session II

114

Programmer Resources

x10-lang.org– Language specification– Online tutorials– X10-related classes and research papers– Performance Tuning Tips– FAQ

• Sample X10 programs– Small examples in X10 distribution and tutorials– X10 Benchmarks: optional download with X10 release– X10 Global Matrix Library– Other Applications

• ANUChem: http://cs.anu.edu.au/~Josh.Milthorpe/anuchem.html• Raytracer and OpenGL KMeans programs• Your application here…

http://cs.anu.edu.au/~Josh.Milthorpe/anuchem.html

115

X10 Project Road Map

Upcoming Releases– X10 2.2.2 mid-December 2011

• Bug fixes and small improvements

– X10 2.3.0 first half of 2012• Java interoperability• Mixed-mode execution (single program running in mix of

Native and Managed places)• Will be backwards compatible with X10 2.2.x, but may

include language enhancements for determinacy

116

X10 Community

Join the X10 community at x10-lang.org!x10-users mailing listList of upcoming events, papers, classes, users, etc.

X10/X10DT releases:http://x10-lang.org/software/download-x10

Try X10! Want to hear your experiences and feedback!

Very interested in publicizing X10 applications and benchmarks that can be shared with the community.

http://x10-lang.org/software/download-x10

117

Questions and Discussion

Questions?

Discussion?

Next up: Lab Session II…

118

Lab Session II

Objectives– Work on a more substantial X10 coding project of

interest to you– Some suggested projects

• Explore GlobalMatrixLibrary sample programs; write some new code using GML

• Try writing a new client of GlobalLoadBalancing• Try to parallelize (and then distribute) your favorite recursive

divide and conquer algorithm• Explore using X10DT by modifying sample programs• Try coding up your favorite problem in X10…

developing scalable parallel applications in...

Documents