compiling for parallel machines
DESCRIPTION
Compiling for Parallel Machines. Kathy Yelick. CS264. Two General Research Goals. Correctness: help programmers eliminate bugs Analysis to detect bugs statically (and conservatively) Tools such as debuggers to help detect bugs dynamically Performance: help make programs run faster - PowerPoint PPT PresentationTRANSCRIPT
Titanium 1CS264, K. Yelick
Compiling for Parallel Machines
CS264
Kathy Yelick
Titanium 2CS264, K. Yelick
Two General Research Goals• Correctness: help programmers eliminate bugs
– Analysis to detect bugs statically (and conservatively)
– Tools such as debuggers to help detect bugs dynamically
• Performance: help make programs run faster– Static compiler optimizations
» May use analyses similar to above to ensure compiler is correctly transforming code
» In many areas, the open problem is determining which transformations should be applied when
– Link or load-time optimizations, including object code translation
– Feedback-directed optimization
– Runtime optimization
• For parallel machines, if you can’t get good performance, what’s the point?
Titanium 3CS264, K. Yelick
A Little History
• Most research on compiling for parallel machines is– automatic parallelization of serial code
– loop-level parallelization (usually Fortran)
• Most parallel programs are written using explicit parallelism, either
– Message passing with a single processor multiple data (SPMD) model
A ) usually MPI with either Fortran or mixed C++ and Fortran for scientific applications
B ) shared memory with a thread and synchronization library in C or Java for non-scientific applications
– Option B is easier to program, but requires hardware support that is still unproven for more than 200 processors
Titanium 4CS264, K. Yelick
Titanium Overview
• Give programmers a global address space– Useful for building large complex data structures that are spread
over the machine
– But, don’t pretend it will have uniform access time (I.e., not quite shared memory)
• Use an explicit parallelism model– SPMD for simplicity
• Extend a “standard” language with data structures for specific problem domain, grid-based scientific applications
– Small amount of syntax added for ease of programming
– General idea: build domain-specific features into the language and optimization framework
Titanium 5CS264, K. Yelick
Titanium Goals
• Performance– close to C/FORTRAN + MPI or better
• Portability– develop on uniprocessor, then SMP, then MPP/Cluster
• Safety– as safe as Java, extended to parallel framework
• Expressiveness– close to usability of threads
– add minimal set of features
• Compatibility, interoperability, etc.– no gratuitous departures from Java standard
Titanium 6CS264, K. Yelick
Titanium Goals
• Performance– close to C/FORTRAN + MPI or better
• Safety– as safe as Java, extended to parallel framework
• Expressiveness– close to usability of threads
– add minimal set of features
• Compatibility, interoperability, etc.– no gratuitous departures from Java standard
Titanium 7CS264, K. Yelick
Titanium
• Take the best features of threads and MPI– global address space like threads (ease programming)
– SPMD parallelism like MPI (for performance)
– local/global distinction, i.e., layout matters (for performance)
• Based on Java, a cleaner C++– classes, memory management
• Language is extensible through classes– domain-specific language extensions
– current support for grid-based computations, including AMR
• Optimizing compiler– communication and memory optimizations
– synchronization analysis
– cache and other uniprocessor optimizations
Titanium 8CS264, K. Yelick
New Language Features• Scalable parallelism
– SPMD model of execution with global address space
• Multidimensional arrays– points and index sets as first-class values to simplify programs– iterators for performance
• Checked Synchronization – single-valued variables and globally executed methods
• Global Communication Library• Immutable classes
– user-definable non-reference types for performance
• Operator overloading– by demand from our user community
• Semi-automated zone-based memory management– as safe as a garbage-collected language– better parallel performance and scalability
Titanium 9CS264, K. Yelick
Lecture Outline
• Language and compiler support for uniprocessor performance
– Immutable classes– Multidimensional Arrays– foreach
• Language support for parallel computation• Analysis of parallel code• Summary and future directions
Titanium 10CS264, K. Yelick
Java: A Cleaner C++
• Java is an object-oriented language– classes (no standalone functions) with methods
– inheritance between classes; multiple interface inheritance only
• Documentation on web at java.sun.com
• Syntax similar to C++class Hello { public static void main (String [] argv) { System.out.println(“Hello, world!”); }}
• Safe– Strongly typed: checked at compile time, no unsafe casts
– Automatic memory management
• Titanium is (almost) strict superset
Titanium 11CS264, K. Yelick
Java Objects
• Primitive scalar types: boolean, double, int, etc.– implementations will store these on the program stack
– access is fast -- comparable to other languages
• Objects: user-defined and from the standard library– passed by pointer value (object sharing) into functions
– has level of indirection (pointer to) implicit
– simple model, but inefficient for small objects
2.6
3true
r: 7.1
i: 4.3
Titanium 12CS264, K. Yelick
Java Object Exampleclass Complex { private double real; private double imag; public Complex(double r, double i) { real = r; imag = i; } public Complex add(Complex c) { return new Complex(c.real + real, c.imag + imag); public double getReal {return real; } public double getImag {return imag;}}
Complex c = new Complex(7.1, 4.3);c = c.add(c);
class VisComplex extends Complex { ... }
Titanium 13CS264, K. Yelick
Immutable Classes in Titanium
• For small objects, would sometimes prefer– to avoid level of indirection
– pass by value (copying of entire object)
– especially when objects are immutable -- fields are unchangeable
» extends the idea of primitive values (1, 4.2, etc.) to user-defined values
• Titanium introduces immutable classes– all fields are final (implicitly)
– cannot inherit from (extend) or be inherited by other classes
– needs to have 0-argument constructor, e.g., Complex ()
immutable class Complex { ... } Complex c = new Complex(7.1, 4.3);
Titanium 14CS264, K. Yelick
Arrays in Java
• Arrays in Java are objects
• Only 1D arrays are directly supported
• Array bounds are checked
• Multidimensional arrays as arrays-of-arrays are slow
Titanium 15CS264, K. Yelick
Multidimensional Arrays in Titanium
• New kind of multidimensional array added– Two arrays may overlap (unlike Java arrays)
– Indexed by Points (tuple of ints)
– Constructed over a set of Points, called Domains
– RectDomains are special case of domains
– Points, Domains and RectDomains are built-in immutable classes
• Support for adaptive meshes and other mesh/grid operations
RectDomain<2> d = [0:n,0:n];
Point<2> p = [1, 2];
double [2d] a = new double [d];
a[0,0] = a[9,9];
Titanium 16CS264, K. Yelick
Naïve MatMul with Titanium Arrays
public static void matMul(double [2d] a, double [2d] b, double [2d] c) { int n = c.domain().max()[1]; // assumes square for (int i = 0; i < n; i++) { for (int j = 0; j < n; j++) { for (int k = 0; k < n; k++) { c[i,j] += a[i,k] * b[k,j]; } } }}
Titanium 17CS264, K. Yelick
Two Performance Issues
• In any language, uniprocessor performance is often dominated by memory hierarchy costs
– algorithms that are “blocked” for the memory hierarchy (caches and registers) can be much faster
• In Titanium, the representation of arrays is fast, but the access methods are expensive
– need optimizations on Titanium arrays
» common subexpression elimination
» eliminate (or hoist) bounds checking
» strength reduce: e.g., naïve code has 1 divide per dimension for each array access
• See Geoff Pike’s work– goal: competitive with C/Fortran performance or better
Titanium 18CS264, K. Yelick
Matrix Multiply (blocked, or tiled)Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N
is called the blocksize
for i = 1 to N
for j = 1 to N
{read block C(i,j) into fast memory}
for k = 1 to N
{read block A(i,k) into fast memory}
{read block B(k,j) into fast memory}
C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks}
{write block C(i,j) back to slow memory}
= + *
C(i,j) C(i,j) A(i,k)
B(k,j)
Titanium 19CS264, K. Yelick
Memory Hierarchy Optimizations: MatMul
Speed of n-by-n matrix multiply on Sun Ultra-1/170, peak = 330 MFlops
Titanium 20CS264, K. Yelick
Unordered iteration
• Often useful to reorder iterations for caches
• Compilers can do this for simple operations, e.g., matrix multiply, but hard in general
• Titanium adds unordered iteration on rectangular domains
foreach (p within r) { ….. } – p is a Point new point, scoped only within the foreach body
– r is a previously-declared RectDomain
• Foreach simplifies bounds checking as well
• Additional operations on domains and arrays to subset and transform
Titanium 21CS264, K. Yelick
Better MatMul with Titanium Arrays
public static void matMul(double [2d] a, double [2d] b, double [2d] c) { foreach (ij within c.domain()) { double [1d] aRowi = a.slice(1, ij[1]); double [1d] bColj = b.slice(2, ij[2]); foreach (k within aRowi.domain()) { c[ij] += aRowi[k] * bColj[k]; } }}
Current compiler eliminates array overhead, making it comparable to C performance for 3 nested loops
Automatic tiling still TBD
Titanium 23CS264, K. Yelick
Lecture Outline
• Language and compiler support for uniprocessor performance
• Language support for parallel computation– SPMD execution– Global and local references– Communication– Barriers and single– Synchronized methods and blocks (as in Java)
• Analysis of parallel code• Summary and future directions
Titanium 24CS264, K. Yelick
SPMD Execution Model
• Java programs can be run as Titanium, but the result will be that all processors do all the work
• E.g., parallel hello world class HelloWorld {
public static void main (String [] argv) {
System.out.println(“Hello from proc ”
Ti.thisProc());
}
}
• Any non-trivial program will have communication and synchronization between processors
Titanium 25CS264, K. Yelick
SPMD Execution Model
• A common style is compute/communicate
• E.g., in each timestep within fish simulation with gravitation attraction
read all fish and compute forces on mine
Ti.barrier();
write to my fish using new forces
Ti.barrier();
Titanium 26CS264, K. Yelick
SPMD Model
• All processor start together and execute same code, but not in lock-step
• Sometimes they take different branches if (Ti.thisProc() == 0) { … do setup … } for(all data I own) { … compute on data … }
• Common source of bugs is barriers or other global operations inside branches or loops barrier, broadcast, reduction, exchange
• A “single” method is one called by all procs public single static void allStep(…)
• A “single” variable has the same value on all procs int single timestep = 0;
Titanium 27CS264, K. Yelick
SPMD Execution Model
• Barriers and single in FishSimulation (n-body)class FishSim { public static void main (String [] argv) { int allTimestep = 0; int allEndTime = 100; for (; allTimestep < allEndTime; allTimestep++){ read all fish and compute forces on mine Ti.barrier(); write to my fish using new forces Ti.barrier(); } }}
• Single methods inferred; see David Gay’s work
single
singlesingle
Titanium 28CS264, K. Yelick
Global Address Space
• Processes allocate locally
• References can be passed to other processes
Class C { int val;….. }
C gv; // global pointerC local lv; // local pointer
if (thisProc() == 0) {lv = new C();
}gv = broadcast lv from 0; gv.val = …; // full… = gv.val; // functionality
Process 0Other
processes
lv
gv
lv
gv
lv
gv
lv
gv
lv
gv
lv
gv
LOCAL HEAP
LOCAL HEAP
Titanium 29CS264, K. Yelick
Use of Global / Local
• Default is global– easier to port shared-memory programs
– performance bugs common: global pointers are more expensive
– harder to use sequential kernels
• Use local declarations in critical sections
• Compiler can infer many instances of “local”
• See Liblit’s work on LQI (Local Qualification Inference)
Titanium 30CS264, K. Yelick
Local Pointer Analysis [Liblit, Aiken]• Global references simplify programming, but incur
overhead, even when data is local– Split-C therefore requires global pointers be declared explicitly
– Titanium pointers global by default: easier, better portability
• Automatic “local qualification” inferenceEffect of LQI
0
50
100
150
200
250
cannon lu sample gsrb poison
applications
run
nin
g t
ime
(s
ec
)
Original
After LQI
Titanium 31CS264, K. Yelick
Parallel performance
• Speedup on Ultrasparc SMP
• AMR largely limited by – current algorithm
– problem size
– 2 levels, with top one serial
• Not yet optimized with “local” for distributed memory
0
1
2
3
4
5
6
7
8
1 2 4 8
em3d
amr
Titanium 32CS264, K. Yelick
Lecture Outline
• Language and compiler support for uniprocessor performance
• Language support for parallel computation• Analysis and Optimization of parallel code
– Tolerate network latency: Split-C experience– Hardware trends and reordering– Semantics: sequential consistency– Cycle detection: parallel dependence analysis– Synchronization analysis: parallel flow analysis
• Summary and future directions
Titanium 33CS264, K. Yelick
Split-C Experience: Latency Overlap
• Titanium borrowed ideas from Split-C– global address space
– SPMD parallelism
• But, Split-C had non-blocking accesses built in to tolerate network latency on remote read/write
• Also one-way communication
• Conclusion: useful, but complicated
int *global p; x := *p; /* get */ *p := 3; /* put */ sync; /* wait for my puts/gets */
*p :- x; /* store */all_store_sync; /* wait globally */
Titanium 34CS264, K. Yelick
Other sources of Overlap
• Would like compiler to introduce put/get/store.
• Hardware also reorders– out-of-order execution
– write buffered with read by-pass
– non-FIFO write buffers
– weak memory models in general
• Software already reorders too– register allocation
– any code motion
• System provides enforcement primitives– e.g., memory fence, volatile, etc.
– tend to be heavy wait and with unpredictable performance
• Can the compiler hide all this?
Titanium 35CS264, K. Yelick
Semantics: Sequential Consistency
• When compiling sequential programs:
Valid if y not in expr1 and x not in expr2 (roughly)
• When compiling parallel code, not sufficient test.
y = expr2;
x = expr1;
x = expr1;
y = expr2;
Initially flag = data = 0
Proc A Proc B
data = 1; while (flag==1);
flag = 1; ….. = …..data…..;
Titanium 36CS264, K. Yelick
Cycle Detection: Dependence Analog
• Processors define a “program order” on accesses from the same thread
P is the union of these total orders
• Memory system define an “access order” on accesses to the same variable
A is access order (read/write & write/write pairs)
• A violation of sequential consistency is cycle in P U A.
• Intuition: time cannot flow backwards.
write data read flag
write flag read data
Titanium 37CS264, K. Yelick
Cycle Detection
• Generalizes to arbitrary numbers of variables and processors
• Cycles may be arbitrarily long, but it is sufficient to consider only cycles with 1 or 2 consecutive stops per processor [Sasha & Snir]
write x write y read y
read y write x
Titanium 38CS264, K. Yelick
Static Analysis for Cycle Detection
• Approximate P by the control flow graph
• Approximate A by undirected “dependence” edges
• Let the “delay set” D be all edges from P that are part of a minimal cycle
• The execution order of D edge must be preserved; other P edges may be reordered (modulo usual rules about serial code)
• Synchronization analsysis also critical [Krishnamurthy]
write z read x
read y write z
write y read x
Titanium 39CS264, K. Yelick
Automatic Communication Optimization
• Implemented in subset of C with limited pointers [Krishnamurthy, Yelick]
• Experiments on the NOW; 3 synchronization styles
• Future: pointer analysis and optimizations for AMR [Jeh, Yelick]
Tim
e (
no
rma
lized
)
Titanium 40CS264, K. Yelick
Other Language Extensions
Java extensions for expressiveness & performance
• Operator overloading
• Zone-based memory management
• Foreign function interface
The following is not yet implemented in the compiler
• Parameterized types (aka templates)
Titanium 41CS264, K. Yelick
Implementation
• Strategy– compile Titanium into C
– Solaris or Posix threads for SMPs
– Active Messages (Split-C library) for communication
• Status– runs on SUN Enterprise 8-way SMP
– runs on Berkeley NOW
– runs on the Tera (not fully tested)
– T3E port partially working
– SP2 port under way
Titanium 42CS264, K. Yelick
Titanium Status
• Titanium language definition complete.
• Titanium compiler running.
• Compiles for uniprocessors, NOW, Tera, t3e, SMPs, SP2 (under way).
• Application developments ongoing.
• Lots of research opportunities.
Titanium 43CS264, K. Yelick
Future Directions
• Super optimizers for targeted kernels– e.g., Phipack, Sparsity, FFTW, and Atlas
– include feedback and some runtime information
• New application domains– unstructured grids (aka graphs and sparse matrices)
– I/O-intensive applications such as information retrieval
• Optimizing I/O as well as communication– uniform treatment of memory hierarchy optimizations
• Performance heterogeneity from the hardware– related to dynamic load balancing in software
• Reasoning about parallel code– correctness analysis: race condition and synchronization analysis
– better analysis: aliases and threads
– Java memory model and hiding the hardware model
Titanium 44CS264, K. Yelick
Backup Slides
Titanium 45CS264, K. Yelick
Point, RectDomain, Arrays in General
Point<2> lb = [1, 1];Point<2> ub = [10, 20];RectDomain<2> R = [lb : ub : [2, 2]];double [2d] A = new double[r];...foreach (p in A.domain()) {
A[p] = B[2 * p + [1, 1]];}
• Points specified by a tuple of ints
• RectDomains given by: – lower bound point
– upper bound point
– stride point
• Array given by RectDomain and element type
Titanium 46CS264, K. Yelick
AMR Poisson
• Poisson Solver [Semenzato, Pike, Colella]– 3D AMR
– finite domain
– variable coefficients
– multigrid across levels
• Performance of Titanium implementation– Sequential multigrid performance +/- 20% of Fortran
– On fixed, well-balanced problem of 8 patches, 723 parallel speedups of 5.5 on 8 processors
Titanium 47CS264, K. Yelick
Distributed Data Structures
• Build distributed data structures: – broadcast or exchange
RectDomain <1> single allProcs = [0:Ti.numProcs-1];
RectDomain <1> myFishDomain = [0:myFishCount-1];
Fish [1d] single [1d] allFish =
new Fish [allProcs][1d];
Fish [1d] myFish = new Fish [myFishDomain];
allFish.exchage(myFish);
• Now each processor has an array of global pointers, one to each processors chunk of fish
Titanium 48CS264, K. Yelick
Consistency Model
• Titanium adopts the Java memory consistency model
• Roughly: Access to shared variables that are not synchronized have undefined behavior.
• Use synchronization to control access to shared variables.
– barriers
– synchronized methods and blocks
Titanium 49CS264, K. Yelick
Example: Domain
Point<2> lb = [0, 0];Point<2> ub = [6, 4];RectDomain<2> r = [lb : ub : [2, 2]];…Domain<2> red = r + (r + [1, 1]);foreach (p in red) {
...}
(0, 0)
(6, 4)r
(1, 1)
(7, 5)r + [1, 1]
(0, 0)
(7, 5)red
• Domains in general are not rectangular
• Built using set operations– union, +
– intersection, *
– difference, -
• Example is red-black algorithm
Titanium 50CS264, K. Yelick
Example using Domains and foreach
• Gauss-Seidel red-black computation in multigrid
void gsrb() {
boundary (phi);
for (domain<2> d = res; d != null;
d = (d == red ? black : null)) {
foreach (q in d)
res[q] = ((phi[n(q)] + phi[s(q)] + phi[e(q)] + phi[w(q)])*4
+ (phi[ne(q) + phi[nw(q)] + phi[se(q)] + phi[sw(q)])
- 20.0*phi[q] - k*rhs[q]) * 0.05;
foreach (q in d) phi[q] += res[q];
}
}
unordered iteration
Titanium 51CS264, K. Yelick
Applications
• Three-D AMR Poisson Solver (AMR3D)– block-structured grids
– 2000 line program
– algorithm not yet fully implemented in other languages
– tests performance and effectiveness of language features
• Other 2D Poisson Solvers (under development)– infinite domains
– based on method of local corrections
• Three-D Electromagnetic Waves (EM3D)– unstructured grids
• Several smaller benchmarks