developing scalable parallel applications in...

© 2012 IBM Corporation

Developing Scalable Parallel Applications in X10

David Grove, Vijay Saraswat, Olivier TardieuIBM TJ Watson Research Center

Based on material from previous X10 Tutorials by Bard Bloom, David Cunningham, Evelyn Duesterwald Steve Fink, Nate Nystrom,

Igor Peshansky, Christoph von Praun, and Vivek SarkarAdditional contributions from Anju Kambadur, Josh Milthorpe and Juemin Zhang

This material is based upon work supported in part by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002.

Please see http://x10-lang.org/tutorials/sc2012 for the most up-to-date version of this tutorial.

http://x10-lang.org/tutorials/sc2012


Tutorial Objectives

Learn to develop scalable parallel applications using X10– X10 language and class library fundamentals– Effective patterns of concurrency and distribution– Case studies: draw material from real working code

Exposure to Asynchronous Partitioned Global Address Space (APGAS) Programming Model– Concepts broader than any single language– Move beyond Single Program/Multiple Data (SPMD) to more flexible paradigms

Updates on the X10 project and user/research community– Highlight what is new in X10 since SC’11

Gather feedback from you– About your application domains– About X10


Tutorial Outline

X10 Overview (70 minutes)– Programming model concepts– Key language features– Implementation highlights

Scaling SPMD-style computations in X10 (40 minutes)– Scaling finish/async – Scaling communication– Collective operations

Unbalanced computations (20 minutes)– Global Load Balancing & UTS

Application frameworks (20 minutes)– Global Matrix Library– SAT X10

Summary and Discussion (20 minutes)


X10 Genesis: DARPA HPCS Program (2004)

Central Challenge: Productive Programming of Large Scale Supercomputers– Clustered systems

• 1000’s of SMP Nodes connected by high-performance interconnect• Large aggregate memory, disk storage, etc.

SMP ClustersSMP Clusters

. . .

Memory

PEs,

SMP Node

PEs,

. . . . . .

Memory

PEs,

SMP Node

PEs,

Interconnect

Massively Parallel Processor Systems

Massively Parallel Processor Systems

IBM Blue GeneIBM Blue Gene


Flash forward a few years… Big Data and Commercial HPC

Central Challenge: Productive programming of large commodity clusters– Clustered systems

• 100’s to 1000’s of SMP Nodes connected by high-performance network• Large aggregate memory, disk storage, etc.

SMP ClustersSMP Clusters

. . .

Memory

PEs,

SMP Node

PEs,

. . . . . .

Memory

PEs,

SMP Node

PEs,

Network

Commodity cluster != MPP systembut the programming model problem is highly similar


X10 Performance and Productivity at Scale

An evolution of Java for concurrency, scale out, and heterogeneity– Language focuses on high productivity and high performance– Bring productivity gains from commercial world to HPC developers

The X10 language provides:– Java-like language (statically typed, object oriented, garbage-collected)– Ability to specify scale-out computations (multiple places, exploit modern networks)– Ability to specify fine-grained concurrency (exploit multi-core)– Single programming model for computation offload and heterogeneity (exploit GPUs)– Migration path

• X10 concurrency/distribution idioms can be realized in other languages via library APIs that wrap the X10 runtime

• X10 interoperability with Java and C/C++ enables reuse of existing libraries


Partitioned Global Address Space (PGAS) Languages

In clustered systems, memory is only accessible to the CPUs on its node

Managing local vs. remote memory is a key programming task

PGAS combines a single logical global address with locality awareness– PGAS Languages: Titanium, UPC, CAF, X10, Chapel


Fine grained concurrency

• async S

Place-shifting operations

• at (P) S

• at (P) { e }

Sequencing

• finish S

… …… …

Activities

Local Heap

Place 0

… …… …

Activities

Local Heap

Place N

…

Global Reference

X10 combines PGAS with asynchrony (APGAS)

Atomicity

• when (c) S

• atomic S


Hello Whole World

1/class HelloWholeWorld {2/ public static def main(args:Rail[String]) {3/ finish 4/ for (p in Place.places()) 5/ at (p) 6/ async 7/ Console.OUT.println(p+" says " +args(0)); 8/ }9/}

("

% x10c++ HelloWholeWorld.x10

% X10_NPLACES=4; ./a.out hello

Place 0 says hello

Place 2 says hello

Place 3 says hello

Place 1 says hello


Sequential Monty Pi

import x10.io.Console;import x10.util.Random;

class MontyPi { public static def main(args:Array[String](1)) {

val N = Int.parse(args(0));val r = new Random();var result:Double = 0;for (1..N) {val x = r.nextDouble();val y = r.nextDouble();if (x*x + y*y <= 1) result++;

}val pi = 4*result/N;Console.OUT.println(“The value of pi is “ + pi);

}}


Concurrent Monty Pi



val N = Int.parse(args(0));val P = Int.parse(args(1));val result = new Cell[Double](0);finish for (1..P) async {val r = new Random();var myResult:Double = 0;for (1..(N/P)) {val x = r.nextDouble();val y = r.nextDouble();if (x*x + y*y <= 1) myResult++;

}atomic result() += myResult;

}val pi = 4*(result())/N;Console.OUT.println(“The value of pi is “ + pi);

}}


Concurrent Monty Pi (Collecting Finish)



val N = Int.parse(args(0));val P = Int.parse(args(1));val result = finish (Reducible.SumReducer[Double]()) {

for (1..P) async { val r = new Random(); var myResult:Double = 0; for (1..(N/P)) { val x = r.nextDouble(); val y = r.nextDouble(); if (x*x + y*y <= 1) myResult++;

} offer myResult;

} };val pi = 4*result/N;Console.OUT.println(“The value of pi is “ + pi);

}}


Distributed Monty Pi (Collecting Finish)



val N = Int.parse(args(0));val result = finish (Reducible.SumReducer[Double]()) {

for (p in Place.places()) at (p) async { val r = new Random(); var myResult:Double = 0; for (1..(N/Place.MAX_PLACES)) { val x = r.nextDouble(); val y = r.nextDouble(); if (x*x + y*y <= 1) myResult++;

} offer myResult;

} };val pi = 4*result/N;Console.OUT.println(“The value of pi is “ + pi);

}}


Distributed Monty Pi (GlobalRef)



val N = Int.parse(args(0));val result = GlobalRef[Cell[Double]](new Cell[Double](0));finish for (p in Place.places()) at (p) async {val r = new Random();var myResult:Double = 0;for (1..(N/Place.MAX_PLACES)) {val x = r.nextDouble(); val y = r.nextDouble();if (x*x + y*y <= 1) myResult++;

} val valResult = myResult;at (result) atomic result()() += valResult;

}val pi = 4*(result()())/N;Console.OUT.println(“The value of pi is “ + pi);

}}


Overview of Language Features

Many sequential features of Java inherited unchanged

– Classes (single inheritance)– Interfaces (multiple inheritance)– Instance and static fields– Constructors– Overloaded, over-rideable methods– Automatic memory management (GC)– Familiar control structures, etc

Structs– Headerless, inlined, user-defined primitives

Closures– Encapsulate function as code block +

dynamic environment

Type system extensions– Dependent types– Generic types (instantiation-based)– Type inference (local)

Concurrency– Fine-grained concurrency

async S;– Atomicity and synchronization

atomic S; when (c) S;

– Ordering– finish S;

Distribution– Place shifting

at (p) S;– Distributed object model

• Automatic serialization object graphs• GlobalRef[T] to create cross-Place references• PlaceLocalHandle[T] to define objects with distributed state

– Class library support for RDMA, collectives, etc.

© 2012 IBM Corporation16

async

async S

• Creates a new child activity that executes statement S

• Returns immediately

• S may reference variables in enclosing blocks

• Activities cannot be named

• Activity cannot be aborted or cancelled

Stmt ::= async Stmt

cf Cilk’s spawn

// Compute the Fibonacci// sequence in parallel.def fib(n:Int):Int {if (n < 2) return 1; val f1:Int;

val f2:Int;finish { async f1 = fib(n-1);f2 = fib(n-2);

}return f1+f2;

}


// Compute the Fibonacci// sequence in parallel.def fib(n:Int):Int {if (n < 2) return 1; val f1:Int;

val f2:Int;finish { async f1 = fib(n-1);f2 = fib(n-2);

}return f1+f2;

}

finish

finish S

Execute S, but wait until all (transitively) spawned asyncs have terminated.

finish is useful for expressing “synchronous” operations on (local or) remote data.

Rooted exception model

– Trap all exceptions thrown by spawned activities.

– Throw an (aggregate) exception if any spawned async terminates abruptly.

implicit finish around program main

Stmt ::= finish Stmt

cf Cilk’s sync


// push data onto concurrent // list-stackval node = new Node(data);atomic {node.next = head;head = node;

}

atomic

atomic S Execute statement S atomically

Atomic blocks are conceptually executed in a serialized order with respect to all other atomic blocks: isolation and weak atomicity.

An atomic block body (S) ... must be nonblocking must not create concurrent activities

(sequential) must not access remote data (local) restrictions checked dynamically

// target defined in lexically// enclosing scope.atomic def CAS(old:Object, n:Object) {if (target.equals(old)) {target = n;return true;

}return false;

}

Stmt ::= atomic StatementMethodModifier ::= atomic


when

when (E) S

Activity suspends until a state inwhich the guard E is true.

In that state, S is executed atomically and in isolation.

Guard E is a boolean expression must be nonblocking

must not create concurrent activities (sequential)

must not access remote data (local)

must not have side-effects (const)

restrictions mostly checked dynamically; const restriction is not checked.

Stmt ::= WhenStmtWhenStmt ::= when ( Expr ) Stmt | WhenStmt or (Expr) Stmt

class OneBuffer {var datum:Object = null;var filled:Boolean = false;def send(v:Object) { when ( !filled ) {datum = v;filled = true;

}}def receive():Object {when (filled) {val v = datum;datum = null;filled = false;return v;

}}

}


at

at (p) S

Execute statement S at place p

Current activity is blocked until S completes

Deep copy of local object graph to target place; the variables referenced from S define the roots of the graph to be copied.

GlobalRef[T] used to create remote pointers across at boundaries

Stmt ::= at (p) Stmt

class C { var x:int;def this(n:int) { x = n; }

}

// Increment remote counterdef inc(c:GlobalRef[C]) {at (c.home) c().x++;

}

// Create GR of Cstatic def make(init:int) {val c = new C(init);return GlobalRef[C](c);

}


Distributed Object Model

Objects live in a single place

Objects can only be accessed in the place where they live

Implementing at – Compiler analyzes the body of “at” and identifies roots to copy (exposed variables)– Complete object graph reachable from roots is serialized and sent to destination place– A new (unrelated) copy of the object graph is created at the destination place

Controlling object graph serialization– Instance fields of class may be declared transient (won’t be copied)– GlobalRef[T] allows cross-place pointers to be created


Take a breath…

Whirlwind tour through X10 language– More details will be introduced as we work through the example codes– See http://x10-lang.org for specification, samples, etc.

– Next topic: brief overview of the X10 implementation

Questions so far?

http://x10-lang.org/


X10 Target Environments

High-end large HPC clusters– BlueGene/P (since 2010); BlueGene/Q (in progress)– Power7IH (aka PERCS machine)– x86 + InfiniBand, Power + InfiniBand– Goal: deliver scalable performance competitive with C+MPI

Medium-scale commodity systems– ~100 nodes (~1000 core and ~1 terabyte main memory)– Goal: deliver main-memory performance with simple programming model (accessible to

Java programmers)

Developer laptops– Linux, Mac OSX, Windows. Eclipse-based IDE, debugger, etc– Goal: support developer productivity


X10 Implementation Summary

X10 Implementations– C++ based (“Native X10”)

• Multi-process (one place per process; multi-node)• Linux, AIX, MacOS, Cygwin, BlueGene• x86, x86_64, PowerPC

– JVM based (“Managed X10”)• Multi-process (one place per JVM process; multi-node)• Limitation on Windows to single process (single place)• Runs on any Java 6 JVM

X10DT (X10 IDE) available for Windows, Linux, Mac OS X– Based on Eclipse 3.7– Supports many core development tasks including remote build/execute facilities

IBM Parallel Debugger for X10 Programming– Adds X10 language support to IBM Parallel Debugger– Available on IBM developerWorks (Native X10 on Linux only)


X10 Compilation

X10Source

Parsing /Type Check

AST OptimizationsAST LoweringX10 AST

X10 AST

C++ CodeGeneration

Java CodeGeneration

C++ Source Java Source

C++ Compiler Java CompilerXRC XRJXRX

Native Code Bytecode

X10RT

X10 Compiler Front-End

C++ Back-End

Java Back-End

Java VMsNative Env

JNINative X10 Managed X10


X10 Runtime Software Stack

Core Class Libraries– Fundamental classes & primitives, Arrays,

core I/O, collections, etc– Written in X10; compiled to C++ or Java

XRX (X10 Runtime in X10)– APGAS functionality

•Concurrency: async/finish (workstealing)•Distribution: Places/at

– Written in X10; compiled to C++ or Java

X10 Language Native Runtime– Runtime support for core sequential X10

language features– Two versions: C++ and Java

X10RT– Active messages, collectives, bulk data

transfer– Implemented in C– Abstracts/unifies network layers (PAMI,

DCMF, MPI, etc) to enable X10 on a range of transports

X10 Application Program

X10 Core Class Libraries

XRX Runtime

X10 Language Native Runtime

X10RT

PAMI DCMF MPI

X10

Ru

ntim

e

TCP/IP


Performance Model: async, finish, and when

Pool of worker threads in each Place– Each worker maintains a queue of pending asyncs– Worker pops async from its own queue, executes it, repeat…– If worker’s queue empty, steal from another worker’s queue

Finish typically does not block worker (run subtask in queue instead of blocking)

Finish may block because of thieves or remote async– worker 1 reaches end of finish block while worker 2 is still running async

A blocking finish (or a when) incurs OS-level context switch – May incur thread creation cost (maintain set of active workers)

Cost analysis– async

• queue operations (fences, at worst CAS)• heap allocated task object (allocation + GC cost)

– finish (single place)• synchronized counter• heap allocated finish object (allocation + GC cost)


Performance Model: Places and at

At runtime, each Place is a separate process

Costs of at– Serialization/deserialization (function of size of captured state)– Communication (if destination != here)– Scheduling (worker thread may block) (if destination != here)


Performance Model: Native X10 vs. Managed X10

Native X10: base performance model is C++

Object-oriented features more expensive– C++ compilers don’t optimize virtual

dispatch, interface invocation, instanceof, etc– Memory management: BDWGC

• Allocation approx like malloc• Conservative GC, (non-moving)• Parallel, but not concurrent

Generics map to C++ templates

Structs map to C++ classes with no virtual methods that are inlined in containing object/array

C++ compilation is mostly file-based, which limits cross-class optimization

More deterministic execution (less runtime jitter)

Managed X10: base performance model is Java

Object-oriented features less expensive– JVMs do optimize virtual dispatch, interface

invocation, instanceof, etc– Memory management

• Very fast allocation• Highly tuned garbage collectors

Generics map to Java generics– Mostly erased– Primitives boxed

Structs map to Java classes (no difference between X10 structs and X10 classes)

JVM dynamically optimizes entire program

Less deterministic execution (more jitter) due to JIT compilation, profile-directed optimization, GC


Tutorial Outline







Towards a Scalable HelloWholeWorld

Why won’t HelloWholeWorld scale?

1/class HelloWholeWorld {2/ public static def main(args:Rail[String]) {3/ finish 4/ for (p in Place.places()) 5/ at (p) 6/ async 7/ Console.OUT.println(p+" says " +args(0)); 8/ }9/}

1. Using most general (default) implementation of finish

2. Sequential initiation of asyncs

3. Serializing more data than is actually necessary (minor improvement)


Towards a Scalable HelloWholeWorld (2)

Minor change: optimize serialization to reduce message size

public static def main(args:Rail[String]) { finish for (p in Place.places()) { val arg = args(0); at (p) async Console.OUT.println("(At " + here + ") : " + arg); } }



Optimize finish implementation via Pragma annotation

public static def main(args:Rail[String]) { @Pragma(Pragma.FINISH_SPMD) finish for (p in Place.places()) { val arg = args(0); at (p) async Console.OUT.println("(At " + here + ") : " + arg); } }



Parallelize async initiation … this scales nicely, but it is getting complicated!

public static def main(args:Rail[String]) { @Pragma(Pragma.FINISH_SPMD) finish for(var i:Int=Place.numPlaces()-1; i>=0; i-=32) { at (Place(i)) async { val max = Runtime.hereInt(); val min = Math.max(max-31, 0); @Pragma(Pragma.FINISH_SPMD) finish for (var j:Int=min; j<=max; ++j) { at (Place(j)) async Console.OUT.println("(At "+here+ ") : " + arg); } } }


Finally: a Simple and Scalable HelloWholeWorld

Put the complexity in the standard library!

broadcastFlat encapsulates the chunked loop & pragma; use a closure (function) to get a clean API

Recurring theme in X10: simple primitives + strong abstraction = new powerful user-defined primitives

public static def main(args:Rail[String]) { val arg = args(0); PlaceGroup.WORLD.broadcastFlat(()=>{ Console.OUT.println("(At " + here + ") : " + arg); }); }


K-Means: description of the problem

Given:– A set of points (black dots) in D-dimensional space

(example, D = 2)– Desired number of partitions K

(example, K = 4)

Calculate:– The K cluster locations (red blobs).– Minimize average distance to cluster


Lloyd's Algorithm

Computes good (not best) clusters.

Initialize with a guess for the K positions.

Iteratively improve until we reach fixed point.

Algorithm:

while (!bored) {

classify points by nearest cluster

average equivalence classes for improved clusters

}


Summary of Scalable KMeans in X10

Natural formulation as an SPMD-style computation– Evenly distribute data points across Places– Iteratively

• Each Place assigns its points to the closest cluster• Each Place computes new cluster centroids• Global reduction to average together computed centroids

Use X10 Team APIs for scalable reductions

Complete benchmark: 150 lines of X10 code


KMEANS

40

Goal

unsupervised clustering

Performance

1 task: 6.13s

1470 hosts: 6.27s

– 97% efficiency

Configuration

40000*#tasks points

4096 clusters

12 dimensions

run time for 5 iterations

Performance on 1470 node IBM Power775


KMeans initialization

PlaceGroup.WORLD.broadcastFlat(()=>{val role = Runtime.hereInt();val random = new Random(role);val host_points = new Rail[Float](num_slice_points*dim,

(Int)=>random.next()); val host_clusters = new Rail[Float](num_clusters*dim); val host_cluster_counts = new Rail[Int](num_clusters);

val team = Team.WORLD;

if (role == 0)for (k in 0..(num_clusters-1))

for (d in 0..(dim-1))host_clusters(k*dim+d) = host_points(k+d*num_slice_points);

val old_clusters = new Rail[Float](num_clusters*dim);

team.allreduce(role, host_clusters, 0, host_clusters, 0, host_clusters.size, Team.ADD);

... Main loop implementing Lloyd’s algorithm (next slide) ...

});


KMeans compute kernel

for (iter in 0..(iterations-1)) {Array.copy(host_clusters, old_clusters);host_clusters.clear();host_cluster_counts.clear();

// Assign each point to the closest clusterfor (p in …) {

…sequential compute kernel elided…}

// Compute new clustersteam.allreduce(role, host_clusters, 0, host_clusters, 0,

host_clusters.size, Team.ADD); team.allreduce(role, host_cluster_counts, 0, host_cluster_counts, 0, host_cluster_counts.size, Team.ADD);

for (k in 0..(num_clusters-1))for (d in 0..(dim-1))

host_clusters(k*dim+d) /= host_cluster_counts(k);

team.barrier(role);}


LU Factorization

HPL-like algorithm

2-d block-cyclic data distribution

right-looking variant with row partial pivoting (no look ahead)

recursive panel factorization with pivot search and column broadcast combined


Summary of Scalable LU in X10

SPMD-style implementation– one top-level async per place (broadcastFlat)– globally synchronous progress

collectives: barriers, broadcast, reductions (row, column, WORLD)

point-to-point RDMAs for row swaps

uses BLAS for local computations

Complete Benchmark: 650 lines of X10 + 160 lines of C++ to wrap native BLAS


LU

45

Goal

LU decomposition

Performance

1 task: 22.38 gflops

1 host: 20.62 gflops

1 drawer: 19.17 gflops

1024 hosts: 17.98 gflops– 80% efficiency

Configuration

71% memory (16M pages)

p*p or 2p*p grid

block size: 360– panel block size: 20



Accessing Native code

import x10.compiler.*;

@NativeCPPInclude("essl_natives.h")@NativeCPPCompilationUnit("essl_natives.cc")class LU {

@NativeCPPExternnative static def blockTriSolve(me:Rail[Double],

diag:Rail[Double], B:Int):void;@NativeCPPExternnative static def blockTriSolveDiag(diag:Rail[Double],

min:Int, max:Int, B:Int):void;

...// Use of blockTriSolve:blockTriSolveDiag(diag, min, max, B);


Allocation of Distributed BlockedArray (in Huge Pages)

public final class BlockedArray implements (Int,Int)=>Double {

public def this(I:Int, J:Int, bx:Int, by:Int, rand:Random) {min_x = I*bx;min_y = J*by;max_x = (I+1)*bx-1;max_y = (J+1)*by-1;

delta = by; offset = (I*bx+J)*by; raw = new Rail[Double](

IndexedMemoryChunk.allocateZeroed[Double](bx*by, 8, IndexedMemoryChunk.hugePages()));

}

public static def make(M:Int, N:Int, bx:Int, by:Int, px:Int, py:Int) {return PlaceLocalHandle.makeFlat[BlockedArray](Dist.makeUnique(),

()=>new BlockedArray(M, N, bx, by, px ,py)); }}

// In main program: val A = BlockedArray.make(M, N, B, B, px, py);


Row exchange via asynchronous RDMAs

def exchange(row1:Int, row2:Int, min:Int, max:Int, dest:Int) {val source = here;ready = false;val size = A_here.getRow(row1, min, max, buffer);val _buffers = buffers; // use local variable to avoid serializing this!val _A = A;

val _remoteBuffer = remoteBuffer; // RemoteArray (GlobalRef) wrapping buffer @Pragma(Pragma.FINISH_ASYNC_AND_BACK) finish {

at (Place(dest)) async {@Pragma(Pragma.FINISH_ASYNC) finish {

Array.asyncCopy[Double](_remoteBuffer, 0, _buffers(), 0, size);}

val size2 = _A().swapRow(row2, min, max, _buffers()); Array.asyncCopy[Double](_buffers(), 0, _remoteBuffer, 0, size2);

} }A_here.setRow(row1, min, max, buffer);

}


Tutorial Outline







Global Load Balancing

A significant class of compute-intensive problems can be cast as “Collecting Finish” (or “Map Reduce”) problems (significantly simpler than Hadoop “Map Shuffle Reduce” problems)

– A parallel “Map” phases spawns work (may be driven by examining data in some shard global data structure)

– Each mapper may contribute 0 or more pieces of information to a “result”.– These pieces can be combined by a commutative, associative “reduction

operator”– Cf collecting finish

Q: How do you efficiently execute such programs if the amount of work associated with each mapper could vary dramatically?

– Example: State-space search, Unbalanced Tree Search (UTS), Branch and Bound Search.


Global Load Balancing: Central Problem

Goal: Efficiently (~ 90% efficiency) execute collecting finish programs on a large number of node.

Problems to be solved:– Balance work across a large number of places efficiently– Detect termination of the computation quickly.

Conventional Approach: – Worker steals from randomly chosen victim when local work is exhausted.– Key problem: When does a worker know there is no more work?


Global Load Balance: The Lifeline Solution

Intuition: X0 already has a framework for distributed termination detection – finish. Use it!

But this requires activities terminate at some point!

Idea: Let activities terminate after w rounds of failed steals. But ensure that a worker with some work can “distribute” work (= use at(p) async S) to nodes known to be idle.

– Lifeline graphs. Before a worker dies it creates z lifelines to other nodes. – These form a distribution “overlay” graph.– What kind of graph? Need low out-degree, low diameter graph.

Our paper/implementation uses hyper-cube.


Algorithm sketch for Lifeline-based GLB Framework

Main activity spawns an activity per place within a finish.

Augment work-stealing with work-distributing:– Each activity processes. – When no work left, steals at most w-times.– (Thief) If no work found, visits at most z-lifelines in turn.

• Donor records thief needs work– (Thief) If no work found, dies.– (Donor) Pushes work to recorded thieves, when it has some.

When all activities die, main activity continues to completion.

Caveat: This scheme assumes that work is relocatable


UTS: Unbalanced Tree Search

Characteristics– highly irregular => need for global load balancing– geometric fixed law (–t 1 –a 3) => wide tree but shallow depth

Implementation– work queue of SHA1 hash of node + depth + interval of child ids– global work stealing

• n random victims (among m possible targets) then p lifelines (hypercube)• thief steals half of work (half of each node in the local work queue)• finish counts lifeline steals only (25% performance improvement at scale)

– single thread (thief and victim cooperate)

Complete benchmark: 560 lines of X10 (+ 800 lines of sequential C for SHA1 hash)

54


Goal

unbalanced tree search

Performance

1 task: 10.9 M nodes/s

1470 hosts: 505 B nodes/s– 98% efficiency

Configuration

geometric fixed law– depth between 14 and 22

at scale (1470 hosts)– -t 1 -a 3 -b 4 -r 19 -d 22– 69,312,400,211,958

nodes

UTS



UTS Main Loop: processStack

@Inline final def processAtMostN() {var i:Int=0;for (; (i<n) && (queue.size>0); ++i) {

queue.expand();}queue.count += i;return queue.size > 0;

} final def processStack(st:PlaceLocalHandle[UTS]) {

do {while (processAtMostN()) {

Runtime.probe();distribute(st);

}reject(st);

} while (steal(st));}


UTS: steal

def steal(st:PlaceLocalHandle[UTS]) {if (P == 1) return false;val p = Runtime.hereInt();empty = true;

for (var i:Int=0; i < w && empty; ++i) {waiting = true;at (Place(victims(random.nextInt(m))))

@Uncounted async st().request(st, p, false);while (waiting) Runtime.probe();

}

for (var i:Int=0; (i<lifelines.size) && empty && (0<=lifelines(i)); ++i) {val lifeline = lifelines(i);if (!lifelinesActivated(lifeline)) {

lifelinesActivated(lifeline) = true;waiting = true;at (Place(lifeline))

@Uncounted async st().request(st, p, true);while (waiting) Runtime.probe();

}}

return !empty;}


UTS: give

def give(st:PlaceLocalHandle[UTS], loot:Queue.Fragment) {val victim = Runtime.hereInt();if (thieves.size() > 0) {

val thief = thieves.pop(); if (thief >= 0) {

at (Place(thief)) @Uncounted async { st().deal(st, loot, victim); st().waiting = false;

}} else {

at (Place(-thief-1)) @Uncounted async { st().deal(st, loot, -1); st().waiting = false;

} }

} else {val thief = lifelineThieves.pop();at (Place(thief)) async st().deal(st, loot, victim);

}}


UTS: distribute & reject

def distribute(st:PlaceLocalHandle[UTS]) {var loot:Queue.Fragment;while ((lifelineThieves.size() + thieves.size() > 0) &&

(loot = queue.grab()) != null) {give(st, loot);

} reject(st);}

def reject(st:PlaceLocalHandle[UTS]) {while (thieves.size() > 0) {

val thief = thieves.pop();if (thief >= 0) {

lifelineThieves.push(thief);at (Place(thief)) @Uncounted async { st().waiting = false; }

} else {at (Place(-thief-1)) @Uncounted async { st().waiting = false; }

}}

}


UTS: deal

def deal(st:PlaceLocalHandle[UTS], loot:Queue.Fragment, source:Int) {val lifeline = source >= 0;if (lifeline) lifelinesActivated(source) = false;if (active) {

empty = false;processLoot(loot, lifeline);

} else {active = true;processLoot(loot, lifeline);processStack(st);

active = false;}

}


Tutorial Outline







X10 Global Matrix Library

Easy-use matrix library in X10 supporting parallel execution on multiple places– Presents standard sequential interface (internal parallelism and distribution).

Matrix data structures:– Dense matrix symmetric and triangular matrix: column-wise storage– Sparse matrix: CSC-LT– Vector, distributed and duplicated vector– Block matrix, distributed and duplicated block matrix

Matrix operations– scaling, cellwise add, subtract, multiply and division– Matrix multiplication– Norm, trace, linear equation solver

Approximately 45,000 lines of X10 code (open source)


Global Matrix Library

X10

GML

Native C/C++ back end

MPI Socket PAMI PGAS Socket

Managed Java back end

BLAS

LAPACK3rd party C-MPI library

Dense SparseCSC Vector

Block matrix

Sparse matrix X10 driver

Distr. block Dupl. block

Team

Multi-thread BLAS (GotoBLAS)

Dense matrix X10 driver

Dense matrix BLAS wrap


Matrix Partitioning and Distribution

Grid partitioning

Balanced partitioning

User-specified partitioning

Block matrix

Dense, sparse, or mixed blocks

Block distribution: block - place mapping

Unique, constant, cyclic, grid, vertical, horizontal and random

B0

Bm

B1

Bm-1

Bm*(n-1)

Bm*n-1

P0

B0

P1 P2 P3

B1 B2 B3



Grid partitioning



Block matrix




B0

Bm

B1

Bm-1

Bm*(n-1)

Bm*n-1

P0 P1 P2 P3

B0

B1

Bm-1

P0

P1

Pn-1



Grid partitioning



Block matrix




B0

Bm

B1

Bm-1

Bm*(n-1)

Bm*n-1

P0 P1 P2 P3

B0

Bm

Bm*(n-1)

B1

P0

Pm-1


GML Example: PageRank

def step(G:…, P:…, GP:…, dupGP:…,

UP:…, P:…, E, :…) {

for (1..iteration) {

GP.mult(G,P)

.scale(alpha)

.copyInto(dupGP);//broadcast

P.local()

.mult(E, UP.mult(U,P.local()))

.scale(1-alpha)

.cellAdd(dupGP.local())

.sync(); // broadcast

}}

while (I < max_iteration) {

p = alpha*(G%*%p)+(1-alpha)*(e%*%u%*%p);

}

X10

DML Programmer determines distribution of

the data.– G is block distributed per Row

– P is replicated

Programmer writes sequential, distribution-aware code.

Perform usual optimizations – allocate big data-structures outside loop

Could equally well have been in Java.


PageRank Performance

Triloka x86 cluster (TK)– Managed X10 on socket transport– Native X10 on socket transport

Watson4P (BlueGene/P)– Native X10 on BlueGene DCMF transport

10 40 160 360 640 10000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Page Rank64-place, G nonzero sparsity 0.001

Java-socket TK

Native-socket TK

Native-pgas BGP

Nonzeros in G (million)

Tim

e(s

ec)

16 32 48 640

2

4

6

8

10

12

14

16

Page Rank1000 million nonzeros in G

Java-socket TK

Native-socket TK

Native-pgas BGP

Number of places (processors)

Tim

e(s

ec)


GML Example: Gaussian Non-Negative Matrix Multiplication

Key kernel for topic modeling

Involves factoring a large (D x W) matrix

D ~ 100M

W ~ 100K, but sparse (0.001)

Iterative algorithm, involves distributed sparse matrix multiplication, cell-wise matrix operations.

for (1..iteration) { H.cellMult(WV .transMult(W,V,tW) .cellDiv(WWH .mult(WW.transMult(W,W),H))); W.cellMult(VH .multTrans(V,H) .cellDiv(WHH .mult(W,HH.multTrans(H,H))));}

X10

Key decision is representation for matrix, and its distribution. Note: app code is polymorphic in this

choice.

P0

V W

H

H

HH

P1

P2Pn


GNMF Performance

Test platforms:– Triloka Cluster (TK): 16 Quad-Core AMD Opteron(tm) 2356 nodes, 1-Gb/s ethernet– Watson4P BlueGene/P system (BGP): 2 cards 64 nodes

Test settings: run on 64-place– V: sparsity 0.001, (1,000,000 x 100,000) ~ (10,000,000 x 1,000,000)– W: dense matrix (1,000,000 x 10) ~ (10,000,000 x 10)– H: dense matrix (10 x 1,000,000)

100 200 400 600 800 10000

10

20

30

40

50

60

70

GNMF runtime

Java-socket TK

Native-socket TK

Native-pgas BGP

Nonzero (million) in V

Tim

e(s

ec)

100 200 400 600 800 10000

10

20

30

40

50

60

70GNMF Computat ion t ime

Java-socket TK

Native-socket TK

Native-pgas BGP

Nonzero (million) in V

Tim

e(s

ec)


Global Matrix Library Summary

Easy to use– Simplified matrix operation interface– Managed back end simplifies development, while native back end provides high

performance computing capability– Hiding concurrency and distribution details from apps developers

Hierarchical parallel computing capability– Within a node: multi-thread BLAS/LAPACK for intensive computation– Between nodes: collective communication for coarse-grained parallel computing

Adaptability– Data layout compatible to existing specifications (BLAS, CSC-LT).

Portability– Supporting MPI, socket, PAMI and BlueGene transport

Availability– Open source; x10.gml project in X10 project main svn on SourceForge.net


Sat X10

Framework to combine existing sequential SAT solvers into a parallel portfolio solver– Diversity: small number of base solvers, but different parameterization– Knowledge Sharing: exchange learned clauses between solvers

Structure of SatX10– Minimal modifications to existing solver (add callbacks for knowledge sharing)– Small X10/C++ interoperability wrapper for each solver– Small (100s of lines) X10 program for communication/distribution

Allows parallel solver to run on a single machine with multiple cores and across multiple machines, sharing information such as learned clauses

More information: http://x10-lang.org/satx10

© 2012 IBM CorporationIBM Confidential

SATX10 Architecture

SolverSatX10BaseSolverSatX10Base

Data Objects

Callback *placeIDmaxLenShrCloutBufSizeincomingClsQoutgoingClsQ

Data Objects

Callback *placeIDmaxLenShrCloutBufSizeincomingClsQoutgoingClsQ

Pure Virtual Methods

x10_parseDimacs()x10_nVars()x10_nClauses()x10_solve()x10_printSoln()x10_printStats()

Pure Virtual Methods

x10_parseDimacs()x10_nVars()x10_nClauses()x10_solve()x10_printSoln()x10_printStats()

Other Controls

x10_kill()x10_wasKilled()x10_accessIncBuf()x10_accessOutBuf()

Other Controls

x10_kill()x10_wasKilled()x10_accessIncBuf()x10_accessOutBuf()

SolverX10CallbackSolverX10Callback

SolverSatX10Base *SolverSatX10Base *

Callback Methods

x10_step()x10_processOutgoingCls()

Callback Methods

x10_step()x10_processOutgoingCls()

SatX10__SolverSatX10__Solver

Implement callbacks of base classImplement callbacks of base class

CallbackStats (data)CallbackStats (data)

Routines for X10 to interact with solvers

solve()kill()bufferIncomingCls()printInstanceInfo()printResults()

Routines for X10 to interact with solvers

solve()kill()bufferIncomingCls()printInstanceInfo()printResults()

SatX10.x10SatX10.x10

Main X10 routinesto launch solvers at various places:

Glucose::Solver * Minisat::Solver * …

Main X10 routinesto launch solvers at various places:

Glucose::Solver * Minisat::Solver * …

SatX10Framework 1.0

Minisat::SolverMinisat::Solver

Implement pure virtual methods of base class


Other Methods

bufferOutgoingCls()processIncomgCls()

Other Methods


Glucose::SolverGlucose::Solver



Other Methods


Other Methods


SatX10__MinisatSatX10__Minisat

Specialized solver: Minisat::Solver *Specialized solver: Minisat::Solver *

SatX10__Glucose

SatX10__Glucose

Specialized solver: Glucose::Solver *Specialized solver: Glucose::Solver *

Specialization forindividual solvers


Same machine, 8 cores, clause lengths=1 and 8

Note: Promising but preliminary results; focus so far has been on developing the framework, not on producing a highly competitive solver

Preliminary Empirical Results: Same Machine


Preliminary Empirical Results: Multiple Machines 8 machine/8 cores vs 16 machines/64 cores, clause lengths=1 and 8

Same executable as for single machine --- just different parameters!


Tutorial Outline







X10 Project Highlights since SC’11

Three major releases: 2.2.2, 2.2.3, 2.3.0– Maintained backwards compatibility with X10 2.2.0 language (June 2011)– Language backwards compatibility with 2.2.0 will be maintained in future releases

Smooth Java interoperability– Major feature of X10 2.3.0 release

Application work at IBM– M3R: Main Memory Map Reduce (VLDB’12)– Global Matrix Library (open sourced Oct 2011; available in x10 svn)– SatX10 (SAT’12)– HPC benchmarks (for PERCS) run at scale on 47,000 core Power775


Summary of X10’12 Workshop

Second annual X10 workshop (co-located with PLDI/ECOOP in June 2012)

Papers1. Fast Method Dispatch and Effective Use of Primitives for Reified Generics in Managed

X102. Distributed Garbage Collection for Managed X10,3. StreamX10: A Stream Programming Framework on X10,4. X10-based Massive Parallel Large-Scale Traffic Flow Simulation5. Characterization of Smith-Waterman Sequence Database Search in X106. Introducing ScaleGraph : An X10 Library for Billion Scale Graph Analytics

Invited talks1. A tutorial on X10 and its implementation2. M3R: Increased Performance for In-Memory Hadoop Jobs

X10’13 will be held in June 2013 co-located with PLDI’13


X10 Community: StreamX10

Slide by Haitao Wei, X10 2012


X10 Community: Large-Scale Traffic Flow Simulation

Slide by Toyotaro Suzumura, X10 2012


X10 Community: ScaleGraph

Slide by Miyuru Dayarathna, X10 2012


X10 Community: M3R – Main Memory Map Reduce (VLDB’12)

Java Hadoop jobs Hadoop Map Reduce

Engine

HDFS data

M3REngine

HDFS

Java M3R jobs

M3R/Hadoop adaptor

X10 Java

multiple jobs

multiple jobs

Cache

HDFS

X10 M3R jobs

Significant performance improvements for unmodified Hadoop 1.0 Map/Reduce jobs


X10 Community: ANUChem

• Computational chemistry codes written in X10– FMM: electrostatic calculation using Fast Multipole method– PME: electrostatic calculation using the Smooth Particle Mesh Ewald method– Pumja Rasaayani: energy calculation using Hartree-Fock SCF-method

• Developed by Josh Milthorpe and Ganesh Venkateshwara at ANU– http://cs.anu.edu.au/~Josh.Milthorpe/anuchem.html– Open source (EPL); tracks X10 releases– Around 16,000 lines of X10 code– Overview and initial results in [Milthorpe et al IPDPS 2011]

• Code under active development; significant performance and scalability improvements in current ANUChem 2.2.2 vs. results in IPDPS’11 paper

http://cs.anu.edu.au/~Josh.Milthorpe/anuchem.html

http://cs.anu.edu.au/~Josh.Milthorpe/anuchem.html


Final Thoughts

Give X10 a try– Language definition is stable– Toolchain not production-quality, but good enough for significant application work– Active X10 research & user community

Future X10 Plans– XStack: DSL Technologies for Exascale Computing (D-TEC)

• Building on X10 runtime system• Increase interoperability of X10 and C+MPI+OpenMP (migration path)

– Mixed mode execution • Mix Native X10 and Managed X10 places within single program execution

– For details: see X10 Roadmap at http://x10-lang.org/home/codebase-progress.html

Questions & Discussion


X10 References

main project web site: http://x10-lang.org

publication list: http://x10-lang.org/x10-community/publications-using-x10.html

language spec: http://x10.sourceforge.net/documentation/languagespec/x10-latest.pdf

releases: http://x10-lang.org/software/download-x10/release-list.html

this tutorial: http://x10-lang.org/tutorials/sc2012

http://x10-lang.org/

http://x10-lang.org/x10-community/publications-using-x10.html

http://x10.sourceforge.net/documentation/languagespec/x10-latest.pdf

http://x10-lang.org/software/download-x10/release-list.html

http://x10-lang.org/tutorials/sc2012

developing scalable parallel applications in...

Documents