![Page 1: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/1.jpg)
Revisiting Loop Transformations with X10 Clocks
Tomofumi YukiInria / LIP / ENS LyonX10 Workshop 2015
![Page 2: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/2.jpg)
X10 Workshop 2015
2
The Problem The Parallelism Challenge
cannot escape going parallel parallel programming is hard automatic parallelization is limited
There won’t be any Silver Bullet X10 as a partial answer
high-level language with parallelism in mind
features to control parallelism/locality
![Page 3: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/3.jpg)
X10 Workshop 2015
3
Programming with X10 Small set of parallel constructs
finish/async clocks at (places), atomic, when
Can be composed freely Interesting for both programmer and
compilers also challenging
But, it seems to be under-utilized
![Page 4: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/4.jpg)
X10 Workshop 2015
4
This Paper Exploring how to use X10 clocks
performance
“usual” way
expressivity
alternative
performance
“usual” way
![Page 5: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/5.jpg)
X10 Workshop 2015
5
Context: Loop Transformations Key to expose parallelism
some times it’s easy
but not always
for i for j X[i] += ...
for i forall j X[i] += ...
for i = 1 .. 2N+M forall j = /*complex bounds*/ X[j] = foo(X[2*j-i-1], X[2*j-i+1]);
for i = 0 .. N for j = 1 .. M X[j] = foo(X[j-1], X[j+1]);
![Page 6: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/6.jpg)
X10 Workshop 2015
6
Automatic Parallelization Very sensitive to inputsfor (i=1; i<N; i++) for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1];
for (i=1; i <N-1; i++) for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1];
for (t1=2;t1<=3;t1++) { lbp=1; ubp=t1-1;#pragma omp parallel for private(lbv,ubv,t3) for (t2=lbp;t2<=ubp;t2++) { S1((t1-t2),t2); } }
for (t1=4;t1<=min(M,N);t1++) { S1((t1-1),1); lbp=2; ubp=t1-2;
#pragma omp parallel for private(lbv,ubv,t3) for (t2=lbp;t2<=ubp;t2++) { S1((t1-t2),t2); S2((t1-t2-1),(t2-1)); } S1(1,(t1-1)); }
for (t1=M+1;t1<=N;t1++) { S1((t1-1),1); lbp=2; ubp=M-1;#pragma omp parallel for private(lbv,ubv,t3) for (t2=lbp;t2<=ubp;t2++) { S1((t1-t2),t2); S2((t1-t2-1),(t2-1)); } } for (t1=N+1;t1<=M;t1++) { lbp=t1-N+1; ubp=t1-2;#pragma omp parallel for private(lbv,ubv,t3) for (t2=lbp;t2<=ubp;t2++) { S1((t1-t2),t2); S2((t1-t2-1),(t2-1)); } S1(1,(t1-1)); } }
for (t1=max(M+1,N+1);t1<=N+M-2;t1++) { lbp=t1-N+1; ubp=M-1;
#pragma omp parallel for private(lbv,ubv,t3) for (t2=lbp;t2<=ubp;t2++) { S1((t1-t2),t2); S2((t1-t2-1),(t2-1)); } }
very difficult to understandtrust it or not use it
![Page 7: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/7.jpg)
X10 Workshop 2015
7
Goal: retain the original structure
Expressing with Clocks
async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance;advance;async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance;
for (i=1; i<N; i++) for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1];
for (i=1; i <N-1; i++) for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1];
![Page 8: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/8.jpg)
X10 Workshop 2015
8
Goal: retain the original structure
Expressing with Clocks
async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance;advance;async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance;
![Page 9: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/9.jpg)
X10 Workshop 2015
9
Goal: retain the original structure
Expressing with Clocks
async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance;advance;async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance;
1. make many iterations parallel
![Page 10: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/10.jpg)
X10 Workshop 2015
10
Goal: retain the original structure
Expressing with Clocks
async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance;advance;async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance;
1. make many iterations parallel
2. order them by synchronizations
![Page 11: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/11.jpg)
X10 Workshop 2015
11
Goal: retain the original structure
Expressing with Clocks
async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance;advance;async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance;
1. make many iterations parallel
2. order them by synchronizationscompound effect: parallelism similar to those with loop trans.
![Page 12: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/12.jpg)
X10 Workshop 2015
12
Outline Introduction X10 Clocks Examples Discussion
![Page 13: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/13.jpg)
13
clocks vs barriers Barriers can easily deadlock
Clocks are more dynamic
//P2barrier;S1;
//P1barrier;S0;barrier;
//P2advance;S1;
//P1advance;S0;advance;
![Page 14: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/14.jpg)
14
clocks vs barriers Barriers can easily deadlock
Clocks are more dynamic
//P2barrier;S1;
//P1barrier;S0;barrier;
deadlock
//P2advance;S1;
//P1advance;S0;advance;
![Page 15: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/15.jpg)
15
clocks vs barriers Barriers can easily deadlock
Clocks are more dynamic
//P2barrier;S1;
//P1barrier;S0;barrier;
deadlock
//P2advance;S1;
//P1advance;S0;advance;
OK
![Page 16: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/16.jpg)
16
Dynamicity of Clocks Implicit Syntax
The process creating a clock is also registered
clocked finish for (i=1:N) clocked async { for (j=i:N) advance; S0; }
Creation of a clock Each process is registered
Each process is un-registered Sync registered processes
![Page 17: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/17.jpg)
17
Dynamicity of Clocks Implicit Syntax
Each process waits until all processes starts The primary process has to terminate
first
clocked finish for (i=1:N) clocked async { for (j=i:N) advance; S0; }
![Page 18: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/18.jpg)
18
Dynamicity of Clocks Implicit Syntax
Each process waits until all processes starts The primary process has to terminate
first
clocked finish for (i=1:N) clocked async { for (j=i:N) advance; S0; }
activ
ity 1
![Page 19: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/19.jpg)
19
Dynamicity of Clocks Implicit Syntax
Each process waits until all processes starts The primary process has to terminate
first
clocked finish for (i=1:N) clocked async { for (j=i:N) advance; S0; }
activ
ity 1
activ
ity 2
![Page 20: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/20.jpg)
20
Dynamicity of Clocks Implicit Syntax
Each process waits until all processes starts The primary process has to terminate
first
clocked finish for (i=1:N) clocked async { for (j=i:N) advance; S0; }
activ
ity 1
activ
ity 2
activ
ity 3
activ
ity 4
activ
ity 5
activ
ity 6
![Page 21: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/21.jpg)
21
Dynamicity of Clocks Implicit Syntax
The primary process calls advance each time Different synchronization pattern
clocked finish for (i=1:N) { clocked async { for (j=i:N) advance; S0; }
advance; }
![Page 22: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/22.jpg)
22
Dynamicity of Clocks Implicit Syntax
The primary process calls advance each time Different synchronization pattern
clocked finish for (i=1:N) { clocked async { for (j=i:N) advance; S0; }
advance; } advance by primary activity
![Page 23: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/23.jpg)
23
Dynamicity of Clocks Implicit Syntax
The primary process calls advance each time Different synchronization pattern
clocked finish for (i=1:N) { clocked async { for (j=i:N) advance; S0; }
advance; } advance by primary activity
activ
ity 1
activ
ity 2
activ
ity 3
activ
ity 4
activ
ity 5
activ
ity 6
![Page 24: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/24.jpg)
X10 Workshop 2015
24
Outline Introduction X10 Clocks Examples Discussion
![Page 25: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/25.jpg)
25
Example: Skewing Skewing the loops is not easy
for (i=1:N) for (j=1:N) h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1])
skewing
![Page 26: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/26.jpg)
26
Example: Skewing Skewing the loops is not easy
for (i=1:2N-1) forall (j=max(1,i-N):min(N,i-1)) h[i][j] = foo(h[(i-j)-1][j], h[(i-j)-1][j-1], h[(i-j)][j-1])
skewing
![Page 27: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/27.jpg)
27
Example: Skewing Skewing the loops is not easy
for (i=1:2N-1) forall (j=max(1,i-N):min(N,i-1)) h[i][j] = foo(h[(i-j)-1][j], h[(i-j)-1][j-1], h[(i-j)][j-1])
skewing
changes to loop bounds and
indexing
![Page 28: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/28.jpg)
28
Example: Skewing Equivalent parallelism without changing
loopsclocked finish for (i=1:N) { clocked async for (j=1:N) { h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]); advance; } advance; }
![Page 29: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/29.jpg)
29
Example: Skewing Equivalent parallelism without changing
loopsclocked finish for (i=1:N) { clocked async for (j=1:N) { h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]); advance; } advance; }
locally sequential
the launch of the entire block is deferred
![Page 30: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/30.jpg)
30
Example: Skewing You can have the same skewing
clocked finish for (j=1:N) { clocked async for (i=1:N) { h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]); advance; } advance; }
skewing
![Page 31: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/31.jpg)
31
Example: Skewing You can have the same skewing
clocked finish for (j=1:N) { clocked async for (i=1:N) { h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]); advance; } advance; }
skewing
note: interchangeouter parallel loop with
clocks
![Page 32: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/32.jpg)
32
Example: Skewing You can have the same skewing
clocked finish for (j=1:N) { clocked async for (i=1:N) { h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]); advance; } advance; }
skewing
advance;
![Page 33: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/33.jpg)
X10 Workshop 2015
33
Example: Loop Fission Common use of barriers
forall (i=1:N) S1; S2;
forall (i=1:N) S1;
forall (i=1:N) S2;
for (i=1:N) async { S1; S2; }
for (i=1:N) async { S1; advance; S2; }
![Page 34: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/34.jpg)
X10 Workshop 2015
34
Example: Loop Fusion Removes all the parallelism
for (i=1:N) S1;for (i=1:N) S2;
for (i=1:N) S1; S2;
async for (i=1:N) S1; advance; advance;advance;async for (i=1:N) S2; advance; advance;
![Page 35: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/35.jpg)
X10 Workshop 2015
35
Example: Loop Fusion Sometimes fusion is not too simple
for (i=1:N-1) S1(i);for (i=2:N) S2(i);
S1(1); for (i=2:N-1) S1(i); S2(i);S2(N);
async for (i=1:N-1) S1; advance; advance;advance;async for (i=2:N) S2; advance; advance;
code structure stays# of advance control
![Page 36: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/36.jpg)
X10 Workshop 2015
36
What can be expressed? Limiting factor: parallelism
difficult to use for sequential loop nests works for wave-front parallelism
Intuition clocks defer execution deferring parent activity has cumulative
effect
![Page 37: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/37.jpg)
X10 Workshop 2015
37
Discussion Learning curve
behavior of clock takes time to understand
How much can you express? 1D affine schedules for sure loop permutation is not possible what if we use multiple clocks?
![Page 38: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/38.jpg)
X10 Workshop 2015
38
Potential Applications It might be easier for some people
have multiple ways to write code
Detect X10 fragments with such property convert to forall for performance
![Page 39: Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015](https://reader034.vdocument.in/reader034/viewer/2022051008/5a4d1b4b7f8b9ab0599a5891/html5/thumbnails/39.jpg)
X10 Workshop 2015
39