automatic task generation for dague
DESCRIPTION
Automatic task generation for DAGuE. http:// icl.utk.edu /dague. George Bosilca , Aurelien Bouteiller , Anthony Danalis , Mathieu Faverge , Thomas Herault, Jack Dongarra. The DAGuE system. DAGuE Compiler. Serial Code to Dataflow Representation. Example: QR Factorization. - PowerPoint PPT PresentationTRANSCRIPT
Automatic task generation for DAGuE
George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Herault, Jack Dongarra
http://icl.utk.edu/dague
The DAGuE system
Serial Code to Dataflow Representation
DAGuE Compiler
Example: QR Factorization
Input Format – Quark (PLASMA)for (k = 0; k < MT; k++) {
Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } }}
• Sequential C code • Annotated through
QUARK-specific syntax• Insert_Task• INOUT, OUTPUT, INPUT• REGION_L, REGION_U, REGION_D, …• LOCALITY
• StarPU syntax in progress
DAGuE Compiler analysis steps • Record all USEs
• Record all DEFinitions• Formulate as Omega
Relations:• all true (flow) dependencies• all output dependencies
• Compute the differences• Formulate all anti-
dependencies• Finalize synchronization
edges
for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } }}
Traditional Compiler (Control Flow)for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } }}
for(k…
for(m…
for(n…
for(m…Co
ntro
l Flo
w Gr
aph
Data Flow imposes orderingfor (k = 0; k < MT; k++) {
Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } }}
Dataflow Analysis
• Example on task DGEQRT of QR
for k = 0 .. N-1 A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) for n = k+1 .. N-1 A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )
MEM
k=0
Outgoing DataIncoming Data
n=k+1m=k+1
Dataflow Analysis
• Example on task DGEQRT of QR• Polyhedral Analysis through
Omega• Compute algebraic expressions
for:• Source and destination tasks• Necessary conditions for that
data flow to exist
MEM
k=0
n=k+1m=k+1
Outgoing DataIncoming Data
U
L
k=SIZE-1
for k = 0 .. N-1 A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) for n = k+1 .. N-1 A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )
Intermediate Representation: Job Data Flow
Control flow is eliminated, therefore maximum parallelism is possible
GEQRT(k) /* Execution space */ k = 0..( MT < NT ) ? MT-1 : NT-1 ) /* Locality */ : A(k, k) RW A <- (k == 0) ? A(k, k) : A1 TSMQR(k-1, k, k) -> (k < NT-1) ? A UNMQR(k, k+1 .. NT-1) [type = LOWER] -> (k < MT-1) ? A1 TSQRT(k, k+1) [type = UPPER] -> (k == MT-1) ? A(k, k) [type = UPPER] WRITE T <- T(k, k) -> T(k, k) -> (k < NT-1) ? T UNMQR(k, k+1 .. NT-1) /* Priority */ ;(NT-k)*(NT-k)*(NT-k)
BODY zgeqrt( A, T )END
Intermediate Representation: Job Data Flow
Control flow is eliminated, therefore maximum parallelism is possible
GEQRT(k) /* Execution space */ k = 0..( MT < NT ) ? MT-1 : NT-1 ) /* Locality */ : A(k, k) RW A <- (k == 0) ? A(k, k) : A1 TSMQR(k-1, k, k) -> (k < NT-1) ? A UNMQR(k, k+1 .. NT-1) [type = LOWER] -> (k < MT-1) ? A1 TSQRT(k, k+1) [type = UPPER] -> (k == MT-1) ? A(k, k) [type = UPPER] WRITE T <- T(k, k) -> T(k, k) -> (k < NT-1) ? T UNMQR(k, k+1 .. NT-1) /* Priority */ ;(NT-k)*(NT-k)*(NT-k)
BODY zgeqrt( A, T )END
Dataflow Analysis
A[m][n] :k+1<= m < N-1k+1<= n < N-10 <= k < N-1
DEF
USEA[k’][k’] :0 <= k’ <N-1
for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Dataflow Analysis
A[m][n] :k+1<= m < N-1k+1<= n < N-10 <= k < N-1
DEF
A[k’][k’] :0 <= k’ <N-1
USE
k’ > kCtrl Flow
for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Dataflow Analysis
[k,m,n] -> [k’] :k+1<= m < N-1k+1<= n < N-10 <= k < N-10 <= k’ < N-1k < k’m = k’n = k’
Flow Dependency(RAW) Relation
for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Dataflow Analysis
{[k,m,m] -> [m] : k+1, 0 <= m < N}
Omega Simplifiedfor k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Dataflow Analysis
Output Dependency (WAW)
{[k,m,m] -> [m] : k+1, 0 <= m < N}
Omega Simplifiedfor k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Dataflow Analysis
n=k+1m=k+1
{[k,k+1,k+1] -> [k+1] : 0 <= k <= N-2}
Real Edge: Flow - Outputfor k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Dataflow Analysis
n=k+1m=k+1
{[k,k+1,k+1] -> [k+1] : 0 <= k <= N-2}
Real Edge: Flow - Output
{[k-1,k, k] -> [k] : 0 < k <= N-1}
GEQRT’s incoming edge
for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Dataflow Analysis
n=k+1m=k+1
{[k,k+1,k+1] -> [k+1] : 0 <= k <= N-2}
Real Edge: Flow - Output
{[k-1,k, k] -> [k] : 0 < k <= N-1}(k>0) ? TSMQR(k-1,k,k)
GEQRT’s incoming edge
Dataflow Analysis
{[k] -> [k, n] : 0 <= k <= N-2 && k+1 <= n <=N-1 }
GEQRT - > UNMQRfor k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Dataflow Analysis
{[k] -> [k, n] : 0 <= k <= N-2 && k+1 <= n <=N-1 }
-> (k<N-1) ? UNMQR(k, k+1..N-1)
GEQRT - > UNMQRfor k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Anti-dependencies• In theory, anti-deps do not matter in distributed
memory• But real machines are distributed/shared memory
hybrids
• Anti-deps must create synchronization edges• Overestimating anti-deps is safe (albeit slow)• Output deps should be treated the same … in
theory
for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Anti-dependencies in QR?
for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Anti-dependencies in QR?
n=k+1m=k+1
for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Anti-dependencies in QR?
{[k, m, n] -> [k‘] : n=m=k‘}
TSMQR - > GEQRT
for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Anti-dependencies in QR?
{[k, m, n] -> [k‘] : n=m=k‘}
TSMQR - > GEQRT
n=k+1m=k+1
UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1
GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 )
TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1
TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1{[k]->[k,k+1] : mt >= (k+2)}
{[k]->
[k,n
] : k
< n
< n
t &&
k <
nt-1
}
{[k,n]->
[k,k+1,n]: k < m
t-1}
{[k,m
]->[k
,m,n
] : k
<nt-1
&&
k<n
< nt
{[k,m]->[k,m+1]: m<mt-1}
{[k,m,n]->[n,m
] : n==(k+1) &&
m>n}
{[k,m,n]->[n] : k+1==n && k+1==m}
{[k,m
,n]->[k+1,n] :
m==k+1 && n>m}{[k,m
,n]->[k+1,m,n]: n>1+k &
& m
>1+k}{[k,m
,n]->[k,m+1,n]: m
<mt-1}
UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1
GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 )
TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1
TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1{[k]->[k,k+1] : mt >= (k+2)}
{[k]->
[k,n
] : k
< n
< n
t &&
k <
nt-1
}
{[k,n]->
[k,k+1,n]: k < m
t-1}
{[k,m
]->[k
,m,n
] : k
<nt-1
&&
k<n
< nt
{[k,m]->[k,m+1]: m<mt-1}
{[k,m,n]->[n,m
] : n==(k+1) &&
m>n}
{[k,m,n]->[n] : k+1==n && k+1==m}
{[k,m
,n]->[k+1,n] :
m==k+1 && n>m}{[k,m
,n]->[k+1,m,n]: n>1+k &
& m
>1+k}{[k,m
,n]->[k,m+1,n]: m
<mt-1}
anti-dep:{[k,m,n] -> [k‘] : n=m=k‘}
UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1
GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 )
TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1
TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1{[k]->[k,k+1] : mt >= (k+2)}
{[k]->
[k,n
] : k
< n
< n
t &&
k <
nt-1
}
{[k,n]->
[k,k+1,n]: k < m
t-1}
{[k,m
]->[k
,m,n
] : k
<nt-1
&&
k<n
< nt
{[k,m]->[k,m+1]: m<mt-1}
{[k,m,n]->[n,m
] : n==(k+1) &&
m>n}
{[k,m,n]->[n] : k+1==n && k+1==m}
{[k,m
,n]->[k+1,n] :
m==k+1 && n>m}{[k,m
,n]->[k+1,m,n]: n>1+k &
& m
>1+k}{[k,m
,n]->[k,m+1,n]: m
<mt-1}
anti-dep:{[k,m,n] -> [k‘] : n=m=k‘}
UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1
GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 )
TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1
TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1{[k]->[k,k+1] : mt >= (k+2)}
{[k]->
[k,n
] : k
< n
< n
t &&
k <
nt-1
}
{[k,n]->
[k,k+1,n]: k < m
t-1}
{[k,m
]->[k
,m,n
] : k
<nt-1
&&
k<n
< nt
{[k,m]->[k,m+1]: m<mt-1}
{[k,m,n]->[n,m
] : n==(k+1) &&
m>n}
{[k,m,n]->[n] : k+1==n && k+1==m}
{[k,m
,n]->[k+1,n] :
m==k+1 && n>m}{[k,m
,n]->[k+1,m,n]: n>1+k &
& m
>1+k}{[k,m
,n]->[k,m+1,n]: m
<mt-1}
UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1
GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 )
TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1
TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1{[k]->[k,k+1] : mt >= (k+2)}
{[k]->
[k,n
] : k
< n
< n
t &&
k <
nt-1
}
{[k,n]->
[k,k+1,n]: k < m
t-1}
{[k,m
]->[k
,m,n
] : k
<nt-1
&&
k<n
< nt
{[k,m]->[k,m+1]: m<mt-1}
{[k,m,n]->[n,m
] : n==(k+1) &&
m>n}
{[k,m,n]->[n] : k+1==n && k+1==m}
{[k,m
,n]->[k+1,n] :
m==k+1 && n>m}{[k,m
,n]->[k+1,m,n]: n>1+k &
& m
>1+k}{[k,m
,n]->[k,m+1,n]: m
<mt-1}
FinalizingAnti-deps
FinalizingAnti-deps
FinalizingAnti-deps
Transitive Closureis undecidable!
IsDescendantOf()
Current/Future Work
Can we address non-affine codes?
Example: Reduction Operation• Reduction: apply a user
defined operator on each data and store the result in a single location.(Suppose the operator is associative and commutative)
Example: Reduction Operation• Reduction: apply a user
defined operator on each data and store the result in a single location.(Suppose the operator is associative and commutative)
Issue: Non-affine loops lead to non-polyhedral array accessing
for(s = 1; s < N/2; s = 2*s) for(i = 0; i < N-s; i += 2*s) V[i] = op(V[i], V[i+s])
Example: Reduction Operation reduce(l, p)
: V(p)l = 1 .. depth+1p = 0 .. (MT / (1<<l))
RW A <- (1 == l) ? V(2*p) : A reduce( l-1, 2*p ) -> ((depth+1) == l) ? V(0) -> (0 == (p%2))? A reduce(l+1, p/2) : B reduce(l+1, p/2)READ B <- ((p*(1<<l) + (1<<(l-1))) > MT) ? V(0) <- (1 == l) ? V(2*p+1) <- (1 != l) ? A reduce( l-1, p*2+1 )BODY operator(A, B);END
0
1
2
3Current Solution: Hand-writing of the data dependency using the intermediate Data Flow representation
for(k=0; k<NT; k++){ for (i = 1; i < NT/2; i*=2 ) { for (j = 0; j < NT-i; j+=2*i) { Task_Rdc: A[k][j] += A[k][j+i]; } }}
Handling Reduction
for(k=0; k<NT; k++){ for (i = 1; i < NT/2; i*=2 ) { for (j = 0; j < NT-i; j+=2*i) { Task_Rdc: A[k][j] += A[k][j+i]; } }} Loop Canonicalization
for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] += A[k][j+i]; } }}
Handling Reduction
for(k=0; k<NT; k++){ for (i = 1; i < NT/2; i*=2 ) { for (j = 0; j < NT-i; j+=2*i) { Task_Rdc: A[k][j] += A[k][j+i]; } }} Loop Canonicalization
for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] += A[k][j+i]; } }}
Handling Reduction
for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } }}
Handling Reduction
for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } }}
Handling Reduction
For an edge to exist it must be:j = j'+i' => jj' = (jj * 2**ii) / (2**ii') -
1/2OR
j = j'=> jj' = (jj * 2**ii) / (2**ii')
for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } }}
Handling Reduction
For an edge to exist it must be:j = j'+i' => jj' = (jj * 2**ii) / (2**ii') -
1/2OR
j = j'=> jj' = (jj * 2**ii) / (2**ii')Hard to solve statically
for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } }}
Handling Reduction
For an edge to exist it must be:j = j'+i' => jj' = (jj * 2**ii) / (2**ii') -
1/2OR
j = j'=> jj' = (jj * 2**ii) / (2**ii')
But a given (source) taskhas fixed {ii, jj}
for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } }}
Handling Reduction
jj' = (jj * 2**ii) / (2**ii') – 1/2
jj' = (jj * 2**ii) / (2**ii')
But a given (source) taskhas fixed {ii, jj}
constant
constant
for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } }}
Handling Reduction
jj' = C / (2**ii') – 1/2
jj' = C / (2**ii')But a given (source) task
has fixed {ii, jj}
Handling Reduction
jj' = C / (2**ii') – 1/2
jj' = C / (2**ii')
Finding a destination task means finding integers that satisfy either equation.
Run-time upper bound for cost: log(NT/2)
Handling “Arbitrary” Code?for( i=a(); i<b(); i = c(i) ){ A[d(i)] <- Task( A[e(i)] );}
Handling “Arbitrary” Code?for( i=a(); i<b(); i = c(i) ){ A[d(i)] <- Task( A[e(i)] );}
for( ii=LB; ii<UB; ii++ ){ i = f(ii); A[d(i)] <- Task( A[e(i)] );}
Loop Canonicalization
Handling “Arbitrary” Code?
d(i) = e(i’) : f(LB) < i < i’ < f(UB)
for( ii=LB; ii<UB; ii++ ){ i = f(ii); A[d(i)] <- Task( A[e(i)] );}
Conclusions & Future Work• Static compiler analysis makes coding in DAGuE easier• DAGuE Compiler handles affine codes w/ Quark notation (StarPU
wip)• Anti-dependencies are also handled (60% of the time it works all
the time)• Anti-deps finalization algorithm can be used for
isDescendantOf()
• Non-affine codes can be future input for the compiler• Regular Control Flow• Loops that can be canonicalized
• Evaluating shortcut answers quickly will be the biggest challenge
Parallel machines are hard to program and we should make them even harder - to keep the riff-raff off them. -- Gary Montry
There are 3 rules to follow when parallelizing large codes. Unfortunately, no one knows what these rules are. -- W. Somerset Maugham, Gary Montry