parallelizing iterative computation for multiprocessor architectures peter cappello
TRANSCRIPT
Parallelizing Iterative Computation for Multiprocessor Architectures
Peter Cappello
2
What is the problem?
Create programs for multi-processor unit (MPU)
– Multicore processors
– Graphics processing units (GPU)
3
For whom is it a problem? Compiler designer
ApplicationProgram Compiler Executable
CPU
EASY
4
For whom is it a problem? Compiler designer
ApplicationProgram Compiler Executable
MPU
HARD
5
For whom is it a problem? Application programmer
ApplicationProgram Compiler Executable
MPU
6
Complex machine consequences
• Programmer needs to be highly skilled
• Programming is error-prone
These consequences imply . . .
Increased parallelism increased development cost!
7
Amdahl’s Law
The speedup of a program is bounded by its inherently sequential part.
(http://en.wikipedia.org/wiki/Amdahl's_law)
If– A program needs 20 hours using a CPU– 1 hour cannot be parallelized
Then– Minimum execution time ≥ 1 hour.– Maximum speed up ≤ 20.
8(http://en.wikipedia.org/wiki/Amdahl's_law)
9
Parallelization opportunities
Scalable parallelism resides in 2
sequential program constructs:
• Divide-and-conquer recursion
• Iterative statements (for)
10
2 schools of thought
• Create a general solution
(Address everything somewhat well)
• Create a specific solution
(Address one thing very well)
11
Focus on iterative statements (for)
float[] x = new float[n];
float[] b = new float[n];
float[][] a = new float[n][n];
. . .
for ( int i = 0; i < n; i++ )
{
b[i] = 0;
for ( int j = 0; j < n; j++ )
b[i] += a[i][j]*x[j];
}
12
Matrix-Vector Product
b = Ax, illustrated with a 3X3 matrix, A.
_______________________________
b1 = a11*x1 + a12*x2 + a13*x3
b2 = a21*x1 + a22*x2 + a23*x3
b3 = a31*x1 + a32*x2 + a33*x3
13
a31 a32 a33
a21 a22 a23
a11 a12 a13
x1 x2 x3
x1
x1
x2
x2
x3
x3b1
b2
b3
x1 x2 x3
14
a31 a32 a33
a21 a22 a23
a11 a12 a13
x1 x2 x3
x1
x1
x2
x2
x3
x3
TIME
SPACE
15
a31 a32 a33
a21 a22 a23
a11 a12 a13
x1 x2 x3
x1
x1
x2
x2
x3
x3
SPACE
TIME
16
a31
a32
a33
a21
a22
a23
a11
a12
a13
x1
x2
x3
x1
x1 x
2
x2
x3
x3
SPACE
TIME
17
Matrix Product
C = AB, illustrated with a 2X2 matrices.
c11 = a11*b11 + a12*b21
c12 = a11*b12 + a12*b22
c21 = a21*b11 + a22*b21
c12 = a21*b12 + a22*b22
18
a21 a22
a11 a12
b11
b11 b21
k
row
a21 a22
a11 a12b12
b21
b12
b22
b22
col
19
a11
a21a22
a12
b11
b11 b21
T
S
a21 a22
a11 a12b12
b21
b12
b22
b22
S
20
a21 a22
a11 a12
b11
b11 b21
T
Sa21 a22
a11 a12b12
b21
b12
b22
b22
S
21
Declaring an iterative computation
• Index set
• Data network
• Functions
• Space-time embedding
22
Declaring an Index set
I1: I2:1 ≤ i ≤ j ≤ n 1 ≤ i ≤ n 1 ≤ j ≤ n
i
j
i
j
23
Declaring a Data network
D1:
x: [ -1, 0];
b: [ 0, -1];
a: [ 0, 0];
D2:
x: [ -1, 0];
b: [ -1, -1];
a: [ 0, -1];
x
b
ax
ab
24
I1:
D1:
x: [ -1, 0];
b: [ 0, -1];
a: [ 0, 0];
Declaring an Index set + Data network
i
j
x
b
a
1 ≤ i ≤ j ≤ n
25
Declaring the Functions
R1:float x’ (float x) { return x; }
float b’ (float b, float x, float a)
{ return b + a*x; }
R2:char x’ (char x) { return x; }
boolean b’ (boolean b, char x, char a)
{ return b && a == x; }i
j
26
Declaring a Spacetime embedding
E1:– space = -i + j– time = i + j.
E2:– space1 = i – space2 = j– time = i + j.
time
space
timespace2
space1
27
Declaring an iterative computation Upper triangular matrix-vector product
UTMVP = (I1,D1,F1,E1)
time
space
28
Declaring an iterative computation Full matrix-vector product
UTMVP = (I2,D1,F1,E1)
time
space
29
Declaring an iterative computation Convolution (polynomial product)
UTMVP = (I2,D2,F1,E1)
time
space
30
Declaring an iterative computation String pattern matching
UTMVP = (I2,D2,F2,E1)
time
space
31
Declaring an iterative computation Pipelined String pattern matching
UTMVP = (I2,D2,F2,E2)
timespace2
space1
32
Iterative computation specification
Declarative specification
Is a 4-dimensional design space
(actually 5 dimensional: space embedding is
independent of time embeding)
Facilitates reuse of design components.
33
Starting with an existing language …
• Can infer
– Index set
– Data network
– Functions
• Cannot infer
– Space embedding
– Time embedding
34
Spacetime embedding
• Start with it as a program annotation
• More advanced:
compiler optimized based on program
annotated figure of merit.
35
Work
• Work out details of notation• Implement in Java, C, Matlab, HDL, …• Map virtual processor network to actual processor
network• Map
– Java: map processors to Threads, [links to Channels]– GPU: map processors to GPU processing elements
(Challenge: spacetime embedding depends on underlying architecture)
36
Work …
• The output of 1 iterative computation is
the input to another.
• Develop a notation for specifying
composite iterative computation?
37
Thanks for listening!
Questions?