compiling for parallel machines

Titanium 1CS264, K. Yelick

Compiling for Parallel Machines

CS264

Kathy Yelick


Two General Research Goals• Correctness: help programmers eliminate bugs

– Analysis to detect bugs statically (and conservatively)

– Tools such as debuggers to help detect bugs dynamically

• Performance: help make programs run faster– Static compiler optimizations

» May use analyses similar to above to ensure compiler is correctly transforming code

» In many areas, the open problem is determining which transformations should be applied when

– Link or load-time optimizations, including object code translation

– Feedback-directed optimization

– Runtime optimization

• For parallel machines, if you can’t get good performance, what’s the point?


A Little History

• Most research on compiling for parallel machines is– automatic parallelization of serial code

– loop-level parallelization (usually Fortran)

• Most parallel programs are written using explicit parallelism, either

– Message passing with a single processor multiple data (SPMD) model

A ) usually MPI with either Fortran or mixed C++ and Fortran for scientific applications

B ) shared memory with a thread and synchronization library in C or Java for non-scientific applications

– Option B is easier to program, but requires hardware support that is still unproven for more than 200 processors


Titanium Overview

• Give programmers a global address space– Useful for building large complex data structures that are spread

over the machine

– But, don’t pretend it will have uniform access time (I.e., not quite shared memory)

• Use an explicit parallelism model– SPMD for simplicity

• Extend a “standard” language with data structures for specific problem domain, grid-based scientific applications

– Small amount of syntax added for ease of programming

– General idea: build domain-specific features into the language and optimization framework


Titanium Goals

• Performance– close to C/FORTRAN + MPI or better

• Portability– develop on uniprocessor, then SMP, then MPP/Cluster

• Safety– as safe as Java, extended to parallel framework

• Expressiveness– close to usability of threads

– add minimal set of features

• Compatibility, interoperability, etc.– no gratuitous departures from Java standard


Titanium Goals

• Performance– close to C/FORTRAN + MPI or better

• Safety– as safe as Java, extended to parallel framework

• Expressiveness– close to usability of threads

– add minimal set of features

• Compatibility, interoperability, etc.– no gratuitous departures from Java standard


Titanium

• Take the best features of threads and MPI– global address space like threads (ease programming)

– SPMD parallelism like MPI (for performance)

– local/global distinction, i.e., layout matters (for performance)

• Based on Java, a cleaner C++– classes, memory management

• Language is extensible through classes– domain-specific language extensions

– current support for grid-based computations, including AMR

• Optimizing compiler– communication and memory optimizations

– synchronization analysis

– cache and other uniprocessor optimizations


New Language Features• Scalable parallelism

– SPMD model of execution with global address space

• Multidimensional arrays– points and index sets as first-class values to simplify programs– iterators for performance

• Checked Synchronization – single-valued variables and globally executed methods

• Global Communication Library• Immutable classes

– user-definable non-reference types for performance

• Operator overloading– by demand from our user community

• Semi-automated zone-based memory management– as safe as a garbage-collected language– better parallel performance and scalability


Lecture Outline

• Language and compiler support for uniprocessor performance

– Immutable classes– Multidimensional Arrays– foreach

• Language support for parallel computation• Analysis of parallel code• Summary and future directions


Java: A Cleaner C++

• Java is an object-oriented language– classes (no standalone functions) with methods

– inheritance between classes; multiple interface inheritance only

• Documentation on web at java.sun.com

• Syntax similar to C++class Hello { public static void main (String [] argv) { System.out.println(“Hello, world!”); }}

• Safe– Strongly typed: checked at compile time, no unsafe casts

– Automatic memory management

• Titanium is (almost) strict superset


Java Objects

• Primitive scalar types: boolean, double, int, etc.– implementations will store these on the program stack

– access is fast -- comparable to other languages

• Objects: user-defined and from the standard library– passed by pointer value (object sharing) into functions

– has level of indirection (pointer to) implicit

– simple model, but inefficient for small objects

2.6

3true

r: 7.1

i: 4.3


Java Object Exampleclass Complex { private double real; private double imag; public Complex(double r, double i) { real = r; imag = i; } public Complex add(Complex c) { return new Complex(c.real + real, c.imag + imag); public double getReal {return real; } public double getImag {return imag;}}

Complex c = new Complex(7.1, 4.3);c = c.add(c);

class VisComplex extends Complex { ... }


Immutable Classes in Titanium

• For small objects, would sometimes prefer– to avoid level of indirection

– pass by value (copying of entire object)

– especially when objects are immutable -- fields are unchangeable

» extends the idea of primitive values (1, 4.2, etc.) to user-defined values

• Titanium introduces immutable classes– all fields are final (implicitly)

– cannot inherit from (extend) or be inherited by other classes

– needs to have 0-argument constructor, e.g., Complex ()

immutable class Complex { ... } Complex c = new Complex(7.1, 4.3);


Arrays in Java

• Arrays in Java are objects

• Only 1D arrays are directly supported

• Array bounds are checked

• Multidimensional arrays as arrays-of-arrays are slow


Multidimensional Arrays in Titanium

• New kind of multidimensional array added– Two arrays may overlap (unlike Java arrays)

– Indexed by Points (tuple of ints)

– Constructed over a set of Points, called Domains

– RectDomains are special case of domains

– Points, Domains and RectDomains are built-in immutable classes

• Support for adaptive meshes and other mesh/grid operations

RectDomain<2> d = [0:n,0:n];

Point<2> p = [1, 2];

double [2d] a = new double [d];

a[0,0] = a[9,9];


Naïve MatMul with Titanium Arrays

public static void matMul(double [2d] a, double [2d] b, double [2d] c) { int n = c.domain().max()[1]; // assumes square for (int i = 0; i < n; i++) { for (int j = 0; j < n; j++) { for (int k = 0; k < n; k++) { c[i,j] += a[i,k] * b[k,j]; } } }}


Two Performance Issues

• In any language, uniprocessor performance is often dominated by memory hierarchy costs

– algorithms that are “blocked” for the memory hierarchy (caches and registers) can be much faster

• In Titanium, the representation of arrays is fast, but the access methods are expensive

– need optimizations on Titanium arrays

» common subexpression elimination

» eliminate (or hoist) bounds checking

» strength reduce: e.g., naïve code has 1 divide per dimension for each array access

• See Geoff Pike’s work– goal: competitive with C/Fortran performance or better


Matrix Multiply (blocked, or tiled)Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N

is called the blocksize

for i = 1 to N

for j = 1 to N

{read block C(i,j) into fast memory}

for k = 1 to N

{read block A(i,k) into fast memory}

{read block B(k,j) into fast memory}

C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks}

{write block C(i,j) back to slow memory}

= + *

C(i,j) C(i,j) A(i,k)

B(k,j)


Memory Hierarchy Optimizations: MatMul

Speed of n-by-n matrix multiply on Sun Ultra-1/170, peak = 330 MFlops


Unordered iteration

• Often useful to reorder iterations for caches

• Compilers can do this for simple operations, e.g., matrix multiply, but hard in general

• Titanium adds unordered iteration on rectangular domains

foreach (p within r) { ….. } – p is a Point new point, scoped only within the foreach body

– r is a previously-declared RectDomain

• Foreach simplifies bounds checking as well

• Additional operations on domains and arrays to subset and transform


Better MatMul with Titanium Arrays

public static void matMul(double [2d] a, double [2d] b, double [2d] c) { foreach (ij within c.domain()) { double [1d] aRowi = a.slice(1, ij[1]); double [1d] bColj = b.slice(2, ij[2]); foreach (k within aRowi.domain()) { c[ij] += aRowi[k] * bColj[k]; } }}

Current compiler eliminates array overhead, making it comparable to C performance for 3 nested loops

Automatic tiling still TBD


Lecture Outline


• Language support for parallel computation– SPMD execution– Global and local references– Communication– Barriers and single– Synchronized methods and blocks (as in Java)

• Analysis of parallel code• Summary and future directions


SPMD Execution Model

• Java programs can be run as Titanium, but the result will be that all processors do all the work

• E.g., parallel hello world class HelloWorld {

public static void main (String [] argv) {

System.out.println(“Hello from proc ”

Ti.thisProc());

}

}

• Any non-trivial program will have communication and synchronization between processors



• A common style is compute/communicate

• E.g., in each timestep within fish simulation with gravitation attraction

read all fish and compute forces on mine

Ti.barrier();

write to my fish using new forces

Ti.barrier();


SPMD Model

• All processor start together and execute same code, but not in lock-step

• Sometimes they take different branches if (Ti.thisProc() == 0) { … do setup … } for(all data I own) { … compute on data … }

• Common source of bugs is barriers or other global operations inside branches or loops barrier, broadcast, reduction, exchange

• A “single” method is one called by all procs public single static void allStep(…)

• A “single” variable has the same value on all procs int single timestep = 0;



• Barriers and single in FishSimulation (n-body)class FishSim { public static void main (String [] argv) { int allTimestep = 0; int allEndTime = 100; for (; allTimestep < allEndTime; allTimestep++){ read all fish and compute forces on mine Ti.barrier(); write to my fish using new forces Ti.barrier(); } }}

• Single methods inferred; see David Gay’s work

single

singlesingle


Global Address Space

• Processes allocate locally

• References can be passed to other processes

Class C { int val;….. }

C gv; // global pointerC local lv; // local pointer

if (thisProc() == 0) {lv = new C();

}gv = broadcast lv from 0; gv.val = …; // full… = gv.val; // functionality

Process 0Other

processes

lv

gv

lv

gv

lv

gv

lv

gv

lv

gv

lv

gv

LOCAL HEAP

LOCAL HEAP


Use of Global / Local

• Default is global– easier to port shared-memory programs

– performance bugs common: global pointers are more expensive

– harder to use sequential kernels

• Use local declarations in critical sections

• Compiler can infer many instances of “local”

• See Liblit’s work on LQI (Local Qualification Inference)


Local Pointer Analysis [Liblit, Aiken]• Global references simplify programming, but incur

overhead, even when data is local– Split-C therefore requires global pointers be declared explicitly

– Titanium pointers global by default: easier, better portability

• Automatic “local qualification” inferenceEffect of LQI

0

50

100

150

200

250

cannon lu sample gsrb poison

applications

run

nin

g t

ime

(s

ec

)

Original

After LQI


Parallel performance

• Speedup on Ultrasparc SMP

• AMR largely limited by – current algorithm

– problem size

– 2 levels, with top one serial

• Not yet optimized with “local” for distributed memory

0

1

2

3

4

5

6

7

8

1 2 4 8

em3d

amr


Lecture Outline


• Language support for parallel computation• Analysis and Optimization of parallel code

– Tolerate network latency: Split-C experience– Hardware trends and reordering– Semantics: sequential consistency– Cycle detection: parallel dependence analysis– Synchronization analysis: parallel flow analysis

• Summary and future directions


Split-C Experience: Latency Overlap

• Titanium borrowed ideas from Split-C– global address space

– SPMD parallelism

• But, Split-C had non-blocking accesses built in to tolerate network latency on remote read/write

• Also one-way communication

• Conclusion: useful, but complicated

int *global p; x := *p; /* get */ *p := 3; /* put */ sync; /* wait for my puts/gets */

*p :- x; /* store */all_store_sync; /* wait globally */


Other sources of Overlap

• Would like compiler to introduce put/get/store.

• Hardware also reorders– out-of-order execution

– write buffered with read by-pass

– non-FIFO write buffers

– weak memory models in general

• Software already reorders too– register allocation

– any code motion

• System provides enforcement primitives– e.g., memory fence, volatile, etc.

– tend to be heavy wait and with unpredictable performance

• Can the compiler hide all this?


Semantics: Sequential Consistency

• When compiling sequential programs:

Valid if y not in expr1 and x not in expr2 (roughly)

• When compiling parallel code, not sufficient test.

y = expr2;

x = expr1;

x = expr1;

y = expr2;

Initially flag = data = 0

Proc A Proc B

data = 1; while (flag==1);

flag = 1; ….. = …..data…..;


Cycle Detection: Dependence Analog

• Processors define a “program order” on accesses from the same thread

P is the union of these total orders

• Memory system define an “access order” on accesses to the same variable

A is access order (read/write & write/write pairs)

• A violation of sequential consistency is cycle in P U A.

• Intuition: time cannot flow backwards.

write data read flag

write flag read data


Cycle Detection

• Generalizes to arbitrary numbers of variables and processors

• Cycles may be arbitrarily long, but it is sufficient to consider only cycles with 1 or 2 consecutive stops per processor [Sasha & Snir]

write x write y read y

read y write x


Static Analysis for Cycle Detection

• Approximate P by the control flow graph

• Approximate A by undirected “dependence” edges

• Let the “delay set” D be all edges from P that are part of a minimal cycle

• The execution order of D edge must be preserved; other P edges may be reordered (modulo usual rules about serial code)

• Synchronization analsysis also critical [Krishnamurthy]

write z read x

read y write z

write y read x


Automatic Communication Optimization

• Implemented in subset of C with limited pointers [Krishnamurthy, Yelick]

• Experiments on the NOW; 3 synchronization styles

• Future: pointer analysis and optimizations for AMR [Jeh, Yelick]

Tim

e (

no

rma

lized

)


Other Language Extensions

Java extensions for expressiveness & performance

• Operator overloading

• Zone-based memory management

• Foreign function interface

The following is not yet implemented in the compiler

• Parameterized types (aka templates)


Implementation

• Strategy– compile Titanium into C

– Solaris or Posix threads for SMPs

– Active Messages (Split-C library) for communication

• Status– runs on SUN Enterprise 8-way SMP

– runs on Berkeley NOW

– runs on the Tera (not fully tested)

– T3E port partially working

– SP2 port under way


Titanium Status

• Titanium language definition complete.

• Titanium compiler running.

• Compiles for uniprocessors, NOW, Tera, t3e, SMPs, SP2 (under way).

• Application developments ongoing.

• Lots of research opportunities.


Future Directions

• Super optimizers for targeted kernels– e.g., Phipack, Sparsity, FFTW, and Atlas

– include feedback and some runtime information

• New application domains– unstructured grids (aka graphs and sparse matrices)

– I/O-intensive applications such as information retrieval

• Optimizing I/O as well as communication– uniform treatment of memory hierarchy optimizations

• Performance heterogeneity from the hardware– related to dynamic load balancing in software

• Reasoning about parallel code– correctness analysis: race condition and synchronization analysis

– better analysis: aliases and threads

– Java memory model and hiding the hardware model


Backup Slides


Point, RectDomain, Arrays in General

Point<2> lb = [1, 1];Point<2> ub = [10, 20];RectDomain<2> R = [lb : ub : [2, 2]];double [2d] A = new double[r];...foreach (p in A.domain()) {

A[p] = B[2 * p + [1, 1]];}

• Points specified by a tuple of ints

• RectDomains given by: – lower bound point

– upper bound point

– stride point

• Array given by RectDomain and element type


AMR Poisson

• Poisson Solver [Semenzato, Pike, Colella]– 3D AMR

– finite domain

– variable coefficients

– multigrid across levels

• Performance of Titanium implementation– Sequential multigrid performance +/- 20% of Fortran

– On fixed, well-balanced problem of 8 patches, 723 parallel speedups of 5.5 on 8 processors


Distributed Data Structures

• Build distributed data structures: – broadcast or exchange

RectDomain <1> single allProcs = [0:Ti.numProcs-1];

RectDomain <1> myFishDomain = [0:myFishCount-1];

Fish [1d] single [1d] allFish =

new Fish [allProcs][1d];

Fish [1d] myFish = new Fish [myFishDomain];

allFish.exchage(myFish);

• Now each processor has an array of global pointers, one to each processors chunk of fish


Consistency Model

• Titanium adopts the Java memory consistency model

• Roughly: Access to shared variables that are not synchronized have undefined behavior.

• Use synchronization to control access to shared variables.

– barriers

– synchronized methods and blocks


Example: Domain

Point<2> lb = [0, 0];Point<2> ub = [6, 4];RectDomain<2> r = [lb : ub : [2, 2]];…Domain<2> red = r + (r + [1, 1]);foreach (p in red) {

...}

(0, 0)

(6, 4)r

(1, 1)

(7, 5)r + [1, 1]

(0, 0)

(7, 5)red

• Domains in general are not rectangular

• Built using set operations– union, +

– intersection, *

– difference, -

• Example is red-black algorithm


Example using Domains and foreach

• Gauss-Seidel red-black computation in multigrid

void gsrb() {

boundary (phi);

for (domain<2> d = res; d != null;

d = (d == red ? black : null)) {

foreach (q in d)

res[q] = ((phi[n(q)] + phi[s(q)] + phi[e(q)] + phi[w(q)])*4

+ (phi[ne(q) + phi[nw(q)] + phi[se(q)] + phi[sw(q)])

- 20.0*phi[q] - k*rhs[q]) * 0.05;

foreach (q in d) phi[q] += res[q];

}

}

unordered iteration


Applications

• Three-D AMR Poisson Solver (AMR3D)– block-structured grids

– 2000 line program

– algorithm not yet fully implemented in other languages

– tests performance and effectiveness of language features

• Other 2D Poisson Solvers (under development)– infinite domains

– based on method of local corrections

• Three-D Electromagnetic Waves (EM3D)– unstructured grids

• Several smaller benchmarks

compiling for parallel machines

Documents