cilk - a forerunning language for parallel...
Post on 18-Aug-2020
9 Views
Preview:
TRANSCRIPT
CilkA Forerunning Language for Parallel Programming
Olivier Aumage – STORM TeamInria – LaBRI
Technological Development’s Tuesdays
ST RMStatic Optimizations – Runtime Methods
Olivier Aumage – STORM Team – Cilk 2
1Introduction
Introduction
The two “all-time” goals in parallel programming
Programming parallel applications– Easily
Running parallel applications– Efficiently
The Cilk language and framework played an anticipative rolein reaching these goals for some classes of applications
Olivier Aumage – STORM Team – Cilk – 1. Introduction 3
Cilk
Cilk in a Few Words
A programming environment– A language and compiler: keyword-based extension of C– An execution model and a run-time system
Developed at the MIT– Supertech Research Group– Charles E. Leiserson’s team– Mid-90’s
Olivier Aumage – STORM Team – Cilk – 1. Introduction 4
History
Academic Era — Cilk– 1994: Cilk 1– 1998: The Implementation of the Cilk-5 Multithreaded Language paper
by Matteo Frigo, Charles E. Leiserson, and Keith H. Randall, at PLDI’98
Start-up Era — Cilk++– 2006: Launch of “Cilk Arts” company– 2008: Cilk++ version 1.0
Intel Era — Cilk Plus– 2009: Intel acquires Cilk Arts– 2010: Intel Cilk Plus released as part of the Intel C++ Compiler– 2012: Release of the Cilk Plus support for the GNU GCC Compiler,
implemented by Intel
Olivier Aumage – STORM Team – Cilk – 1. Introduction 5
Context
Middle of nineties
HardwareSMP: Symmetric Multi-ProcessorsNeed for parallel programming models
SoftwareNotion of threads: concurrent processing contexts within single processHow to efficiently/easily express application parallelism using threads?
Program Easily?Parallel program quickly derived from sequential programConcurrency expressed safely (correctness, consistency)
Run Efficiently?No over/under-subscriptionLoad-balancingLow overhead
Olivier Aumage – STORM Team – Cilk – 1. Introduction 6
Context
Middle of nineties
HardwareSMP: Symmetric Multi-ProcessorsNeed for parallel programming models
SoftwareNotion of threads: concurrent processing contexts within single processHow to efficiently/easily express application parallelism using threads?
Program Easily?Parallel program quickly derived from sequential programConcurrency expressed safely (correctness, consistency)
Run Efficiently?No over/under-subscriptionLoad-balancingLow overhead
Olivier Aumage – STORM Team – Cilk – 1. Introduction 6
Context
Middle of nineties
HardwareSMP: Symmetric Multi-ProcessorsNeed for parallel programming models
SoftwareNotion of threads: concurrent processing contexts within single processHow to efficiently/easily express application parallelism using threads?
Program Easily?Parallel program quickly derived from sequential programConcurrency expressed safely (correctness, consistency)
Run Efficiently?No over/under-subscriptionLoad-balancingLow overhead
Olivier Aumage – STORM Team – Cilk – 1. Introduction 6
Context
Middle of nineties
HardwareSMP: Symmetric Multi-ProcessorsNeed for parallel programming models
SoftwareNotion of threads: concurrent processing contexts within single processHow to efficiently/easily express application parallelism using threads?
Program Easily?Parallel program quickly derived from sequential programConcurrency expressed safely (correctness, consistency)
Run Efficiently?No over/under-subscriptionLoad-balancingLow overhead
Olivier Aumage – STORM Team – Cilk – 1. Introduction 6
Context
Nowadays
Hardware
SMP: Symmetric Multi-Processors
Multi-core processors
Hardware multi-threading (SMT, Hyperthreading, etc.)
Even greater need for parallel programming models!
Revived interest for Cilk-like approaches
Olivier Aumage – STORM Team – Cilk – 1. Introduction 7
Context
Nowadays
Hardware
SMP: Symmetric Multi-Processors
Multi-core processors
Hardware multi-threading (SMT, Hyperthreading, etc.)
Even greater need for parallel programming models!
Revived interest for Cilk-like approaches
Olivier Aumage – STORM Team – Cilk – 1. Introduction 7
Olivier Aumage – STORM Team – Cilk 8
2Concept and Ideas
Concept and Ideas
C extension
Task Parallelism paradigm
Work Stealing paradigm
THE Algorithm
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 9
C Extension
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 10
Cilk is a faithful extension of C
Keyword based language– Per opposition to pragma based
languages
Main keywords (original Cilk):– cilk: declaration of a potentially
parallel routine– spawn: launch of a potentially
parallel routine– sync: wait for completion of
launched routines– inlet : special function to
aggregate results (reduction)A faithful extension of C
– The C elision of a Cilk program isa valid C program
C Extension
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 10
Cilk is a faithful extension of C
Keyword based language– Per opposition to pragma based
languagesMain keywords (original Cilk):
– cilk: declaration of a potentiallyparallel routine
– spawn: launch of a potentiallyparallel routine
– sync: wait for completion oflaunched routines
– inlet : special function toaggregate results (reduction)
A faithful extension of C– The C elision of a Cilk program is
a valid C program
1 #include < c i l k . h>2
3 c i l k i n t f i b ( i n t n ) {4 i f ( n < 2) {5 return n ;6 } else {7 i n t x , y ;8 x = spawn f i b ( n−1) ;9 y = spawn f i b ( n−2) ;
10 sync ;11 return x+y ;12 }13 }
C Extension
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 10
Cilk is a faithful extension of C
Keyword based language– Per opposition to pragma based
languagesMain keywords (original Cilk):
– cilk: declaration of a potentiallyparallel routine
– spawn: launch of a potentiallyparallel routine
– sync: wait for completion oflaunched routines
– inlet : special function toaggregate results (reduction)
A faithful extension of C– The C elision of a Cilk program is
a valid C program
1
2
3 i n t f i b ( i n t n ) {4 i f ( n < 2) {5 return n ;6 } else {7 i n t x , y ;8 x = f i b ( n−1) ;9 y = f i b ( n−2) ;
10
11 return x+y ;12 }13 }
C Extension
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 10
Cilk is a faithful extension of C
Keyword based language– Per opposition to pragma based
languagesMain keywords (original Cilk):
– cilk: declaration of a potentiallyparallel routine
– spawn: launch of a potentiallyparallel routine
– sync: wait for completion oflaunched routines
– inlet : special function toaggregate results (reduction)
A faithful extension of C– The C elision of a Cilk program is
a valid C program
1 #include < c i l k . h>2
3 c i l k i n t f i b ( i n t n ) {4 i f ( n < 2) {5 return n ;6 } else {7 i n t x , y ;8 x = spawn f i b ( n−1) ;9 y = spawn f i b ( n−2) ;
10 sync ;11 return x+y ;12 }13 }
Task Parallelism Paradigm
Potential Parallelism
The programmer declares what can be run in parallel
Cilk’s runtime decides what to run in parallel
Work / Worker DecouplingWorkers: threads launched by Cilk
Work: tasks expressed by the Application
RationalizationGeneric worker threads instead of special purpose threads
Externalized common parallelism managementExternalized resource allocation policy
– Potential parallelism vs number of concurrent contexts
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 11
Task Parallelism Paradigm
Potential Parallelism
The programmer declares what can be run in parallel
Cilk’s runtime decides what to run in parallel
Work / Worker DecouplingWorkers: threads launched by Cilk
Work: tasks expressed by the Application
RationalizationGeneric worker threads instead of special purpose threads
Externalized common parallelism managementExternalized resource allocation policy
– Potential parallelism vs number of concurrent contexts
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 11
Task Parallelism Paradigm
Potential Parallelism
The programmer declares what can be run in parallel
Cilk’s runtime decides what to run in parallel
Work / Worker DecouplingWorkers: threads launched by Cilk
Work: tasks expressed by the Application
RationalizationGeneric worker threads instead of special purpose threads
Externalized common parallelism managementExternalized resource allocation policy
– Potential parallelism vs number of concurrent contexts
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 11
Tasks
Notion of frameState of the current cilk function being executed
Live local variables, function arguments
“Program Counter” (PC)
A Frame is a Task
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 12
Tasks
Notion of frameState of the current cilk function being executed
Live local variables, function arguments
“Program Counter” (PC)
A Frame is a Task
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 12
Task Lists
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 13
Notion of frame dequeOne task list per workerImplemented as a “deque”(doubly-ended queue of frames)
– Head H– Tail T
Task Lists
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 13
Notion of frame dequeOne task list per workerImplemented as a “deque”(doubly-ended queue of frames)
– Head H– Tail T
H = T
Worker
8
7
6
5
4
3
1
10
2
9
Task Lists
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 13
Notion of frame dequeOne task list per workerImplemented as a “deque”(doubly-ended queue of frames)
– Head H– Tail T
A worker thread pushes/pops framesat the Tail side of its deque
T
H
Worker
8
7
6
5
4
3
1
10
2
9
Task Lists
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 13
Notion of frame dequeOne task list per workerImplemented as a “deque”(doubly-ended queue of frames)
– Head H– Tail T
A worker thread pushes/pops framesat the Tail side of its deque
T
H
Worker
8
7
6
5
4
3
1
10
2
9
Task Lists
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 13
Notion of frame dequeOne task list per workerImplemented as a “deque”(doubly-ended queue of frames)
– Head H– Tail T
A worker thread pushes/pops framesat the Tail side of its deque
T
H
Worker
8
7
6
5
4
3
1
10
2
9
Task Lists
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 13
Notion of frame dequeOne task list per workerImplemented as a “deque”(doubly-ended queue of frames)
– Head H– Tail T
A worker thread pushes/pops framesat the Tail side of its deque
T
H
Worker
8
7
6
5
4
3
1
10
2
9
Task Spawn
Upon a spawn, the worker. . .. . . suspends the parent function
. . . saves the state of the parent function in its frame
. . . pushes the parent frame on its task list
. . . before calling the spawned child function
When the child function completes and returns, the worker. . .. . . pops the parent frame
. . . restores the state of the parent function from its frame
. . . resumes the parent function
This is more or less what regular functions do. . .
Where is the parallelism?
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 14
Task Spawn
Upon a spawn, the worker. . .. . . suspends the parent function
. . . saves the state of the parent function in its frame
. . . pushes the parent frame on its task list
. . . before calling the spawned child function
When the child function completes and returns, the worker. . .. . . pops the parent frame
. . . restores the state of the parent function from its frame
. . . resumes the parent function
This is more or less what regular functions do. . .
Where is the parallelism?
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 14
Task Spawn
Upon a spawn, the worker. . .. . . suspends the parent function
. . . saves the state of the parent function in its frame
. . . pushes the parent frame on its task list
. . . before calling the spawned child function
When the child function completes and returns, the worker. . .. . . pops the parent frame
. . . restores the state of the parent function from its frame
. . . resumes the parent function
This is more or less what regular functions do. . .
Where is the parallelism?
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 14
Task Spawn
Upon a spawn, the worker. . .. . . suspends the parent function
. . . saves the state of the parent function in its frame
. . . pushes the parent frame on its task list
. . . before calling the spawned child function
When the child function completes and returns, the worker. . .. . . pops the parent frame
. . . restores the state of the parent function from its frame
. . . resumes the parent function
This is more or less what regular functions do. . .
Where is the parallelism?
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 14
Example: Deque Management on Spawn
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 15
1 #include < c i l k . h>2
3 c i l k i n t f i b ( i n t n ) {4 i f ( n < 2) {5 return n ;6 } else {7 i n t x , y ;8 x = spawn f i b ( n−1) ;9 y = spawn f i b ( n−2) ;
10 sync ;11 return x+y ;12 }13 }
Example: Deque Management on Spawn
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 15
1 #include < c i l k . h>2
3 c i l k i n t f i b ( i n t n ) {4 i f ( n < 2) {5 return n ;6 } else {7 i n t x , y ;8 x = spawn f i b ( n−1) ;9 y = spawn f i b ( n−2) ;
10 sync ;11 return x+y ;12 }13 }
H = T
Worker
8
7
6
5
4
3
1
10
2
9
Example: Deque Management on Spawn
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 15
1 #include < c i l k . h>2
3 c i l k i n t f i b ( i n t n ) {4 i f ( n < 2) {5 return n ;6 } else {7 i n t x , y ;8 x = spawn f i b ( n−1) ;9 y = spawn f i b ( n−2) ;
10 sync ;11 return x+y ;12 }13 }
Fib n, step 2
T
H
Worker
8
7
6
5
4
3
1
10
2
9
Example: Deque Management on Spawn
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 15
1 #include < c i l k . h>2
3 c i l k i n t f i b ( i n t n ) {4 i f ( n < 2) {5 return n ;6 } else {7 i n t x , y ;8 x = spawn f i b ( n−1) ;9 y = spawn f i b ( n−2) ;
10 sync ;11 return x+y ;12 }13 }
Fib n−1, step 2
Fib n, step 2
T
H
Worker
8
7
6
5
4
3
1
10
2
9
Example: Deque Management on Spawn
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 15
1 #include < c i l k . h>2
3 c i l k i n t f i b ( i n t n ) {4 i f ( n < 2) {5 return n ;6 } else {7 i n t x , y ;8 x = spawn f i b ( n−1) ;9 y = spawn f i b ( n−2) ;
10 sync ;11 return x+y ;12 }13 }
Fib n−2, step 2
Fib n−1, step 2
Fib n, step 2
T
H
Worker
8
7
6
5
4
3
1
10
2
9
Example: Deque Management on Spawn
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 15
1 #include < c i l k . h>2
3 c i l k i n t f i b ( i n t n ) {4 i f ( n < 2) {5 return n ;6 } else {7 i n t x , y ;8 x = spawn f i b ( n−1) ;9 y = spawn f i b ( n−2) ;
10 sync ;11 return x+y ;12 }13 }
. . .
Fib n−2, step 2
Fib n−1, step 2
Fib n, step 2
T
H
Worker
8
7
6
5
4
3
1
10
2
9
Example: Deque Management on Spawn
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 15
1 #include < c i l k . h>2
3 c i l k i n t f i b ( i n t n ) {4 i f ( n < 2) {5 return n ;6 } else {7 i n t x , y ;8 x = spawn f i b ( n−1) ;9 y = spawn f i b ( n−2) ;
10 sync ;11 return x+y ;12 }13 }
Fib 2, step 2
. . .
Fib n−2, step 2
Fib n−1, step 2
Fib n, step 2
T
H
Worker
8
7
6
5
4
3
1
10
2
9
Example: Deque Management on Spawn
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 15
1 #include < c i l k . h>2
3 c i l k i n t f i b ( i n t n ) {4 i f ( n < 2) {5 return n ;6 } else {7 i n t x , y ;8 x = spawn f i b ( n−1) ;9 y = spawn f i b ( n−2) ;
10 sync ;11 return x+y ;12 }13 }
. . .
Fib n−2, step 2
Fib n−1, step 2
Fib n, step 2
T
H
Worker
8
7
6
5
4
3
1
10
2
9
Example: Deque Management on Spawn
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 15
1 #include < c i l k . h>2
3 c i l k i n t f i b ( i n t n ) {4 i f ( n < 2) {5 return n ;6 } else {7 i n t x , y ;8 x = spawn f i b ( n−1) ;9 y = spawn f i b ( n−2) ;
10 sync ;11 return x+y ;12 }13 }
Fib 2, step 3
. . .
Fib n−2, step 2
Fib n−1, step 2
Fib n, step 2
T
H
Worker
8
7
6
5
4
3
1
10
2
9
Parallelism
Work Stealing paradigmIdle workers steal work. . .
. . . from other worker’s queues
Work stolen as frame/taskA thief resumes a suspended parent task. . .
. . . while a victim runs its child
Load balancing: Idle workers steal from busy workers
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 16
Parallelism
Work Stealing paradigmIdle workers steal work. . .
. . . from other worker’s queues
Work stolen as frame/taskA thief resumes a suspended parent task. . .
. . . while a victim runs its child
Load balancing: Idle workers steal from busy workers
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 16
Parallelism
Work Stealing paradigmIdle workers steal work. . .
. . . from other worker’s queues
Work stolen as frame/taskA thief resumes a suspended parent task. . .
. . . while a victim runs its child
Load balancing: Idle workers steal from busy workers
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 16
Task Spawn [UPDATED]
Upon a spawn, the worker. . .. . . suspends the parent function
. . . saves the state of the parent function in its frame
. . . pushes the parent frame on its task list
. . . before calling the spawned child function
When the child function completes and returns, the worker. . .. . . attempts to pop the parent frameif it succeeds, it. . .
– . . . restores the state of the parent function from its frame– . . . resumes the parent function
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 17
Task Spawn [UPDATED]
Upon a spawn, the worker. . .. . . suspends the parent function
. . . saves the state of the parent function in its frame
. . . pushes the parent frame on its task list
. . . before calling the spawned child function
When the child function completes and returns, the worker. . .. . . attempts to pop the parent frameif it succeeds, it. . .
– . . . restores the state of the parent function from its frame– . . . resumes the parent function
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 17
Work Stealing Implementation
Task lists implementation. . .Doubly-ended queue
Head H
Tail T
. . . with the following rulesWorkers push/pop work at the Tail side T of their own deque
An idle worker (thief) steals work at the Head side H of another worker(victim) deque
T >= H under normal conditions
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 18
Work Stealing Implementation
Task lists implementation. . .Doubly-ended queue
Head H
Tail T
. . . with the following rulesWorkers push/pop work at the Tail side T of their own deque
An idle worker (thief) steals work at the Head side H of another worker(victim) deque
T >= H under normal conditions
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 18
Example: Work Stealing
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 19
1 #include < c i l k . h>2
3 c i l k i n t f i b ( i n t n ) {4 i f ( n < 2) {5 return n ;6 } else {7 i n t x , y ;8 x = spawn f i b ( n−1) ;9 y = spawn f i b ( n−2) ;
10 sync ;11 return x+y ;12 }13 }
Fib 2, step 3
. . .
Fib n−2, step 2
Fib n−1, step 2
Fib n, step 2
T
H
Worker
8
7
6
5
4
3
1
10
2
9
Example: Work Stealing
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 19
1 #include < c i l k . h>2
3 c i l k i n t f i b ( i n t n ) {4 i f ( n < 2) {5 return n ;6 } else {7 i n t x , y ;8 x = spawn f i b ( n−1) ;9 y = spawn f i b ( n−2) ;
10 sync ;11 return x+y ;12 }13 }
Fib 2, step 3
. . .
Fib n−2, step 2
Fib n−1, step 2
T
H
Worker
8
7
6
5
4
3
1
10
2
9
Example: Work Stealing
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 19
1 #include < c i l k . h>2
3 c i l k i n t f i b ( i n t n ) {4 i f ( n < 2) {5 return n ;6 } else {7 i n t x , y ;8 x = spawn f i b ( n−1) ;9 y = spawn f i b ( n−2) ;
10 sync ;11 return x+y ;12 }13 }
Fib 2, step 3
. . .
Fib n−2, step 2
T
H
Worker
8
7
6
5
4
3
1
10
2
9
Work Stealing Implementation Discussion
Doubly-ended queue benefitsTail-side worker push/pop
– Depth-first– Locality
Head side steal– Breadth-first– Potentially favor big steals– Potentially steal less often
Potentially rare deque interferences between worker and thieves
Low overhead, high efficiency for Divide and Conquer algorithms
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 20
Example: Conflict
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 21
1 #include < c i l k . h>2
3 c i l k i n t f i b ( i n t n ) {4 i f ( n < 2) {5 return n ;6 } else {7 i n t x , y ;8 x = spawn f i b ( n−1) ;9 y = spawn f i b ( n−2) ;
10 sync ;11 return x+y ;12 }13 }
Fib 2, step 3
. . .
Fib n−2, step 2
Fib n−1, step 2
Fib n, step 2
T
H
Worker
8
7
6
5
4
3
1
10
2
9
Example: Conflict
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 21
1 #include < c i l k . h>2
3 c i l k i n t f i b ( i n t n ) {4 i f ( n < 2) {5 return n ;6 } else {7 i n t x , y ;8 x = spawn f i b ( n−1) ;9 y = spawn f i b ( n−2) ;
10 sync ;11 return x+y ;12 }13 }
Fib 2, step 3
. . .
Fib n−2, step 2
Fib n−1, step 2
T
H
Worker
8
7
6
5
4
3
1
10
2
9
Example: Conflict
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 21
1 #include < c i l k . h>2
3 c i l k i n t f i b ( i n t n ) {4 i f ( n < 2) {5 return n ;6 } else {7 i n t x , y ;8 x = spawn f i b ( n−1) ;9 y = spawn f i b ( n−2) ;
10 sync ;11 return x+y ;12 }13 }
Fib 2, step 3
. . .
Fib n−2, step 2
T
H
Worker
8
7
6
5
4
3
1
10
2
9
Example: Conflict
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 21
1 #include < c i l k . h>2
3 c i l k i n t f i b ( i n t n ) {4 i f ( n < 2) {5 return n ;6 } else {7 i n t x , y ;8 x = spawn f i b ( n−1) ;9 y = spawn f i b ( n−2) ;
10 sync ;11 return x+y ;12 }13 }
Fib 2, step 3
T
H
Worker
8
7
6
5
4
3
1
10
2
9
TH(E) Algorithm
GeneralitiesT.H.(E.): Tail, Head, (Exception)Optimization of the deque synchronization mechanism
– Favor the potentially frequent worker pushes/pops. . .– . . . over the potentially infrequent thief steals
PrinciplesThe deque is protected by a expensive lock LThe thieves only modify H
– Thieves always take the lock L before changing H
The worker only modifies TConflicts between a worker and a thief arise when. . .
– . . . the invariant T >= H is broken
The worker speculatively modifies T without taking the lock L!
If the speculation fails, the worker retries with the L lock taken
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 22
TH(E) Algorithm
GeneralitiesT.H.(E.): Tail, Head, (Exception)Optimization of the deque synchronization mechanism
– Favor the potentially frequent worker pushes/pops. . .– . . . over the potentially infrequent thief steals
PrinciplesThe deque is protected by a expensive lock LThe thieves only modify H
– Thieves always take the lock L before changing H
The worker only modifies TConflicts between a worker and a thief arise when. . .
– . . . the invariant T >= H is broken
The worker speculatively modifies T without taking the lock L!
If the speculation fails, the worker retries with the L lock taken
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 22
TH(E) Algorithm
GeneralitiesT.H.(E.): Tail, Head, (Exception)Optimization of the deque synchronization mechanism
– Favor the potentially frequent worker pushes/pops. . .– . . . over the potentially infrequent thief steals
PrinciplesThe deque is protected by a expensive lock LThe thieves only modify H
– Thieves always take the lock L before changing H
The worker only modifies TConflicts between a worker and a thief arise when. . .
– . . . the invariant T >= H is broken
The worker speculatively modifies T without taking the lock L!
If the speculation fails, the worker retries with the L lock taken
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 22
TH(E) Algorithm
GeneralitiesT.H.(E.): Tail, Head, (Exception)Optimization of the deque synchronization mechanism
– Favor the potentially frequent worker pushes/pops. . .– . . . over the potentially infrequent thief steals
PrinciplesThe deque is protected by a expensive lock LThe thieves only modify H
– Thieves always take the lock L before changing H
The worker only modifies TConflicts between a worker and a thief arise when. . .
– . . . the invariant T >= H is broken
The worker speculatively modifies T without taking the lock L!
If the speculation fails, the worker retries with the L lock taken
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 22
TH(E) Algorithm Pseudocode
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 23
1 /∗ Worker ∗ /2 push ( ) {3 T++;4 }5
6 pop ( ) {7 T−−;8 __mem_fence__ ( ) ;9 i f (H > T) {
10 T++;11 lock ( L ) ;12 T−−;13 i f (H > T) {14 T++;15 unlock ( L ) ;16 return FAILURE ;17 }18 unlock ( L ) ;19 return SUCCESS;20 }
1 /∗ Th ie f ∗ /2 s t e a l ( ) {3 lock ( L ) ;4 H++;5 __mem_fence__ ( ) ;6 i f (H > T) {7 H−−;8 unlock ( L ) ;9 return FAILURE ;
10 }11 unlock ( L ) ;12 return SUCCESS;13 }
Cilk’s Keywords Summary
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 24
Rehearsal before looking at someexamples
cilk: declaration of a potentiallyparallel routine
spawn: launch of a potentiallyparallel routine
sync: wait for completion oflaunched routines
inlet : special function to aggregateresults (reduction)
1 #include < c i l k . h>2
3 c i l k i n t f i b ( i n t n ) {4 i f ( n < 2) {5 return n ;6 } else {7 i n t x , y ;8 x = spawn f i b ( n−1) ;9 y = spawn f i b ( n−2) ;
10 sync ;11 return x+y ;12 }13 }
Cilk’s Keywords Summary
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 24
Rehearsal before looking at someexamples
cilk: declaration of a potentiallyparallel routine
spawn: launch of a potentiallyparallel routine
sync: wait for completion oflaunched routines
inlet : special function to aggregateresults (reduction)
1 #include < c i l k . h>2
3 c i l k i n t f i b ( i n t n ) {4 i f ( n < 2) {5 return n ;6 } else {7 i n t x , y ;8 x = spawn f i b ( n−1) ;9 y = spawn f i b ( n−2) ;
10 sync ;11 return x+y ;12 }13 }
Cilk’s Keywords Summary
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 24
Rehearsal before looking at someexamples
cilk: declaration of a potentiallyparallel routine
spawn: launch of a potentiallyparallel routine
sync: wait for completion oflaunched routines
inlet : special function to aggregateresults (reduction)
1 #include < c i l k . h>2
3 c i l k i n t f i b ( i n t n ) {4 i n t x = 0;5
6 i n l e t void acc ( i n t res ) {7 x += res ;8 }9
10 i f ( n < 2) {11 return n ;12 } else {13 acc (spawn f i b ( n−1) ) ;14 acc (spawn f i b ( n−2) ) ;15 sync ;16 return x ;17 }18 }
Examples
Fib
Knapsack
Heat
LU
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 25
Code Generation Example
Fib
Two-clone StrategyFast clone executed by worker
– “Streamlined” C functionSlow clone executed by a thief after a steal
– Post-steal logic to restore saved state
Olivier Aumage – STORM Team – Cilk – 2. Concept and Ideas 26
Olivier Aumage – STORM Team – Cilk 27
3Cilk Plus
Intel Cilk Plus
URL: http://www.cilkplus.org/
ChangesSupports C and C++
No need to declare Cilk functions
Main keywordscilk_spawn: similar to original Cilk’s spawn
cilk_sync: similar to original Cilk’s synccilk :: reducer <...>
– Template parameterized with a reduction op– Replacement for inlets
cilk_for: parallel loop
Fortran inspired Array Notation
Olivier Aumage – STORM Team – Cilk – 3. Cilk Plus 28
Intel Cilk Plus
URL: http://www.cilkplus.org/
ChangesSupports C and C++
No need to declare Cilk functions
Main keywordscilk_spawn: similar to original Cilk’s spawn
cilk_sync: similar to original Cilk’s synccilk :: reducer <...>
– Template parameterized with a reduction op– Replacement for inlets
cilk_for: parallel loop
Fortran inspired Array Notation
Olivier Aumage – STORM Team – Cilk – 3. Cilk Plus 28
Cilk Plus Parallel Loop
Olivier Aumage – STORM Team – Cilk – 3. Cilk Plus 29
Work-Stealing Loop
cilk_for keyword
Potentially parallel loop
Recursively divided range
Work-stealing load balancing
1 i n t i ;2
3 for ( i =0; i <N; i ++) {4 f ( i ) ;5 }6
7 /∗ − − − − ∗ /8
9 for ( i =0; i <N; i ++) {10 cilk_spawn f ( i ) ;11 }12 cilk_sync ;13
14 /∗ − − − − ∗ /15
16 c i l k_ fo r ( i =0; i <N; i ++) {17 f ( i ) ;18 }
Intel Cilk Plus Arrays
Array notationInspired by FortranHelps compiler auto-vectorization
– May use SIMD instruction sets (SSE, AVX)
Syntax:
1 double a [ lower_bound : leng th : s t r i d e ] ;
Implicit vector loop:
1 for ( i = lower ; i < ( len∗ s t r i d e ) + lower ; i += s t r i d e )2 . . . a [ i ] . . .
Shortcut syntax to access all elements (static arrays only):
1 a [ : ]
Olivier Aumage – STORM Team – Cilk – 3. Cilk Plus 30
Intel Cilk Plus Arrays
Array notationInspired by FortranHelps compiler auto-vectorization
– May use SIMD instruction sets (SSE, AVX)
Syntax:
1 double a [ lower_bound : leng th : s t r i d e ] ;
Implicit vector loop:
1 for ( i = lower ; i < ( len∗ s t r i d e ) + lower ; i += s t r i d e )2 . . . a [ i ] . . .
Shortcut syntax to access all elements (static arrays only):
1 a [ : ]
Olivier Aumage – STORM Team – Cilk – 3. Cilk Plus 30
Intel Cilk Plus Arrays
Array notationInspired by FortranHelps compiler auto-vectorization
– May use SIMD instruction sets (SSE, AVX)
Syntax:
1 double a [ lower_bound : leng th : s t r i d e ] ;
Implicit vector loop:
1 for ( i = lower ; i < ( len∗ s t r i d e ) + lower ; i += s t r i d e )2 . . . a [ i ] . . .
Shortcut syntax to access all elements (static arrays only):
1 a [ : ]
Olivier Aumage – STORM Team – Cilk – 3. Cilk Plus 30
Intel Cilk Plus Arrays
Array notationInspired by FortranHelps compiler auto-vectorization
– May use SIMD instruction sets (SSE, AVX)
Syntax:
1 double a [ lower_bound : leng th : s t r i d e ] ;
Implicit vector loop:
1 for ( i = lower ; i < ( len∗ s t r i d e ) + lower ; i += s t r i d e )2 . . . a [ i ] . . .
Shortcut syntax to access all elements (static arrays only):
1 a [ : ]
Olivier Aumage – STORM Team – Cilk – 3. Cilk Plus 30
Intel Cilk Plus Arrays
Example expressions1D arrays:
1 a [ : ] = 1 ;2 a [ 2 : 4 ] = 3 ;3 a [ 1 : 1 0 : 2 ] = 7 ; / / se t odd ind i ces
Multi-dimensional arrays:
1 c [ : ] [ : ] = 17;2 c [ 3 ] [ : ] = a [ : ] ; / / a f f e c t an ar ray to some row3 c [ : ] [ 4 ] = a [ : ] ; / / a f f e c t an ar ray to some column4 c [ 0 : 1 0 : 2 ] [ : ] = 1 ; / / se t even rows
Olivier Aumage – STORM Team – Cilk – 3. Cilk Plus 31
Intel Cilk Plus Arrays
Example expressions1D arrays:
1 a [ : ] = 1 ;2 a [ 2 : 4 ] = 3 ;3 a [ 1 : 1 0 : 2 ] = 7 ; / / se t odd ind i ces
Multi-dimensional arrays:
1 c [ : ] [ : ] = 17;2 c [ 3 ] [ : ] = a [ : ] ; / / a f f e c t an ar ray to some row3 c [ : ] [ 4 ] = a [ : ] ; / / a f f e c t an ar ray to some column4 c [ 0 : 1 0 : 2 ] [ : ] = 1 ; / / se t even rows
Olivier Aumage – STORM Team – Cilk – 3. Cilk Plus 31
Intel Cilk Plus Arrays
Advanced example expressionsConditions:
1 i f ( a [ : ] == 1) {2 b [ : ] = 42;3 } else {4 b [ : ] = 0 ;5 }
Scalar function call:
1 void f ( double v ) {2 p r i n t f ( "%l f \ n " , v ) ;3 }4
5 f ( a [ : ] ) ;
Olivier Aumage – STORM Team – Cilk – 3. Cilk Plus 32
Intel Cilk Plus Arrays
Advanced example expressionsConditions:
1 i f ( a [ : ] == 1) {2 b [ : ] = 42;3 } else {4 b [ : ] = 0 ;5 }
Scalar function call:
1 void f ( double v ) {2 p r i n t f ( "%l f \ n " , v ) ;3 }4
5 f ( a [ : ] ) ;
Olivier Aumage – STORM Team – Cilk – 3. Cilk Plus 32
Intel Cilk Plus Arrays
Advanced example expressions (cont’d)Gather:
1 c [ : ] = a [ b [ : ] ] ;
Scatter:
1 a [ b [ : ] ] = c [ : ] ;
Reductions:
1 __sec_reduce_add ( a [ : ] ) ;2 __sec_reduce_mul ( a [ : ] ) ;3 . . .
Olivier Aumage – STORM Team – Cilk – 3. Cilk Plus 33
Intel Cilk Plus Arrays
Advanced example expressions (cont’d)Gather:
1 c [ : ] = a [ b [ : ] ] ;
Scatter:
1 a [ b [ : ] ] = c [ : ] ;
Reductions:
1 __sec_reduce_add ( a [ : ] ) ;2 __sec_reduce_mul ( a [ : ] ) ;3 . . .
Olivier Aumage – STORM Team – Cilk – 3. Cilk Plus 33
Intel Cilk Plus Arrays
Advanced example expressions (cont’d)Gather:
1 c [ : ] = a [ b [ : ] ] ;
Scatter:
1 a [ b [ : ] ] = c [ : ] ;
Reductions:
1 __sec_reduce_add ( a [ : ] ) ;2 __sec_reduce_mul ( a [ : ] ) ;3 . . .
Olivier Aumage – STORM Team – Cilk – 3. Cilk Plus 33
Other Cilk Plus Ports
Cilk Plus / GCCIntegrated in GCC 4.9.2
– Tasks– Array notation– No cilk_for keyword yet
Usage
1 g++ − f c i l k p l u s − l c i l k r t s −o f i b f i b . cpp
Cilk Plus / LLVMURL: http://cilkplus.github.io/
Olivier Aumage – STORM Team – Cilk – 3. Cilk Plus 34
Examples using Cilk Plus
Tasks
Arrays
Olivier Aumage – STORM Team – Cilk – 3. Cilk Plus 35
Olivier Aumage – STORM Team – Cilk 36
4Conclusion
Conclusion
Cilk: A Forerunning Language for Parallel Programming
From the programmer point of viewEasy to program with (only 3-4 keywords)
Efficient on Divide and Conquer application classes
From the HPC research point of viewKey precursory effort in promoting task parallelism
Inspirational for many related works over the last 20 years
The Implementation of the Cilk-5 Multithreaded LanguageMatteo Frigo, Charles E. Leiserson, and Keith H. Randall, PLDI’98
Olivier Aumage – STORM Team – Cilk – 4. Conclusion 37
Olivier Aumage – STORM Team – Cilk 38
5Related Works
Variants and Evolutions of the Task Scheduling Model
Some related works Inspired by CilkOpenMP 3.0 tasks
Intel Threading Building Blocks (TBB)
Evolutions of the task modelDependent Tasks
Heterogeneous Tasks
Olivier Aumage – STORM Team – Cilk – 5. Related Works 39
Variants and Evolutions of the Task Scheduling Model
Some related works Inspired by CilkOpenMP 3.0 tasks
Intel Threading Building Blocks (TBB)
Evolutions of the task modelDependent Tasks
Heterogeneous Tasks
Olivier Aumage – STORM Team – Cilk – 5. Related Works 39
Dependent Tasks
More flexibility in specifying task relationshipsVarious possibilities
– Task – Task dependencies– Task – Data dependencies
Various expressiveness vs automatism trade-offs– Expressed dependencies– Inferred dependencies
Some examplesStarPU (STORM Team)
OpenMP 4.0 tasks
OmpSs (Barcelona Supercomputing Center)
Kaapi (Inria Team MOAIS)
DAGuE / PARSeC (UTK + Inria Team HiePACS)
etc.
Olivier Aumage – STORM Team – Cilk – 5. Related Works 40
Dependent Tasks
More flexibility in specifying task relationshipsVarious possibilities
– Task – Task dependencies– Task – Data dependencies
Various expressiveness vs automatism trade-offs– Expressed dependencies– Inferred dependencies
Some examplesStarPU (STORM Team)
OpenMP 4.0 tasks
OmpSs (Barcelona Supercomputing Center)
Kaapi (Inria Team MOAIS)
DAGuE / PARSeC (UTK + Inria Team HiePACS)
etc.
Olivier Aumage – STORM Team – Cilk – 5. Related Works 40
Ex.: Sequential Cholesky Decomposition
Olivier Aumage – STORM Team – Cilk – 5. Related Works 41
for (j = 0; j < N; j++) {POTRF ( A[j][j]);for (i = j+1; i < N; i++)
TRSM ( A[i][j], A[j][j]);for (i = j+1; i < N; i++) {
SYRK ( A[i][i], A[i][j]);for (k = j+1; k < i; k++)
GEMM ( A[i][k],A[i][j], A[k][j]);
}}
Ex.: Task-Based Cholesky Decomposition
Olivier Aumage – STORM Team – Cilk – 5. Related Works 42
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}__wait__();
Ex.: Inferred Dependencies
Olivier Aumage – STORM Team – Cilk – 5. Related Works 43
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}__wait__();
POTRF
GEMM
TRSM
SYRK
Ex.: Inferred Dependencies
Olivier Aumage – STORM Team – Cilk – 5. Related Works 43
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}__wait__();
POTRF
GEMM
TRSM
SYRK
Ex.: Inferred Dependencies
Olivier Aumage – STORM Team – Cilk – 5. Related Works 43
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}__wait__();
POTRF
GEMM
TRSM
SYRK
Ex.: Inferred Dependencies
Olivier Aumage – STORM Team – Cilk – 5. Related Works 43
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}__wait__();
POTRF
GEMM
TRSM
SYRK
Ex.: Inferred Dependencies
Olivier Aumage – STORM Team – Cilk – 5. Related Works 43
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}__wait__();
POTRF
GEMM
TRSM
SYRK
Ex.: Inferred Dependencies
Olivier Aumage – STORM Team – Cilk – 5. Related Works 43
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}__wait__();
POTRF
GEMM
TRSM
SYRK
Ex.: Inferred Dependencies
Olivier Aumage – STORM Team – Cilk – 5. Related Works 43
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}__wait__();
POTRF
GEMM
TRSM
SYRK
Ex.: Inferred Dependencies
Olivier Aumage – STORM Team – Cilk – 5. Related Works 43
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}__wait__();
POTRF
GEMM
TRSM
SYRK
Ex.: Inferred Dependencies
Olivier Aumage – STORM Team – Cilk – 5. Related Works 43
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}__wait__();
POTRF
GEMM
TRSM
SYRK
Ex.: Inferred Dependencies
Olivier Aumage – STORM Team – Cilk – 5. Related Works 43
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}__wait__();
POTRF
GEMM
TRSM
SYRK
Ex.: Inferred Dependencies
Olivier Aumage – STORM Team – Cilk – 5. Related Works 43
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}__wait__();
POTRF
GEMM
TRSM
SYRK
Ex.: Inferred Dependencies
Olivier Aumage – STORM Team – Cilk – 5. Related Works 43
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}__wait__();
POTRF
GEMM
TRSM
SYRK
Ex.: Inferred Dependencies
Olivier Aumage – STORM Team – Cilk – 5. Related Works 43
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}__wait__();
POTRF
GEMM
TRSM
SYRK
Heterogeneous Tasks
Olivier Aumage – STORM Team – Cilk – 5. Related Works 44
Special purpose accelerator devices(or general purpose GPUs)
(initially) a discrete expansion card
Rationale: dye area trade-off
A single control unit. . .
. . . for several computing units
Allows flows to diverge
. . . but better avoid it!
Heterogeneous Tasks
Olivier Aumage – STORM Team – Cilk – 5. Related Works 44
Special purpose accelerator devices(or general purpose GPUs)
(initially) a discrete expansion card
Rationale: dye area trade-off
Single Instruction Multiple Threads (SIMT)
A single control unit. . .
. . . for several computing units
Allows flows to diverge
. . . but better avoid it!
Heterogeneous Tasks
Olivier Aumage – STORM Team – Cilk – 5. Related Works 44
Special purpose accelerator devices(or general purpose GPUs)
(initially) a discrete expansion card
Rationale: dye area trade-off
Single Instruction Multiple Threads (SIMT)
A single control unit. . .
. . . for several computing units
Allows flows to diverge
. . . but better avoid it!
GPU
DRAM
Control
Control Scalar Cores
(Streaming Processors)
Streaming Multiprocessor
R1 + R2
R5 / R2
Scalar Cores
Heterogeneous Tasks
Olivier Aumage – STORM Team – Cilk – 5. Related Works 44
Special purpose accelerator devices(or general purpose GPUs)
(initially) a discrete expansion card
Rationale: dye area trade-off
Single Instruction Multiple Threads (SIMT)
A single control unit. . .
. . . for several computing units
Allows flows to diverge
. . . but better avoid it!
GPU
Control Scalar Cores
(Streaming Processors)
Streaming Multiprocessor
R1 + R2
...if(cond){
... ... ...
} else { ... ...}...
GPU Hardware Model
CPU vs GPU
Multiple strategies for multiple purposesCPU
– Strategy– Large caches– Large control
– Purpose– Complex codes, branching– Complex memory access patterns
– World Rally Championship carGPU
– Strategy– Lot of computing power– Simplified control
– Purpose– Regular data parallel codes– Simple memory access patterns
– Formula One car
Olivier Aumage – STORM Team – Cilk – 5. Related Works 45
CPU
GPU
Control ALU ALU
ALU ALU
Cache
DRAM
DRAM
Heterogeneous Tasks
Main issuesOffload or not offload?
– Computation kernel efficiency on accelerator wrt on the main CPU?Discrete accelerator memory
– Data transfer management– Data transfer cost
Some examplesStarPU (STORM Team)
OmpSs (Barcelona Supercomputing Center)
DAGuE / PARSeC (UTK + Inria Team HiePACS)
Olivier Aumage – STORM Team – Cilk – 5. Related Works 46
StarPU Showcase with the MAGMA Linear Algebra Library
UTK, INRIA HIEPACS, INRIA RUNTIME
QR decomposition on 16 CPUs (AMD) + 4 GPUs (C1060)
Olivier Aumage – STORM Team – Cilk – 5. Related Works 47
0
200
400
600
800
1000
0 5000 10000 15000 20000 25000 30000 35000 40000
Gflo
p/s
Matrix order
4 GPUs + 16 CPUs4 GPUs + 4 CPUs3 GPUs + 3 CPUs2 GPUs + 2 CPUs1 GPUs + 1 CPUs
Expected increase: +12 CPUs ~150 Gflops
Measured increase: +12 CPUs ~200 GFlops
Showcase with the MAGMA Linear Algebra Library
QR kernel propertiesKernel SGEQRTCPU: 9 GFlop/s GPU: 30 GFlop/s Speed-up: 3Kernel STSQRTCPU: 12 GFlop/s GPU: 37 GFlop/s Speed-up: 3Kernel SOMQRTCPU: 8.5 GFlop/s GPU: 227 GFlop/s Speed-up: 27Kernel SSSMQCPU: 10 GFlop/s GPU: 285 GFlop/s Speed-up: 28
ConsequencesTask distribution
– SGEQRT: 20% Tasks on GPU– SSSMQ: 92% tasks on GPU
StarPU takes advantage of heterogeneity!– Only use the accelerator for kernels it is good for– Don’t slow down the accelerator with kernels it is not good for
Olivier Aumage – STORM Team – Cilk – 5. Related Works 48
Upcoming Tutorial
PRACE PATC Training session on Runtime SystemsAt INRIA in Bordeaux, France
June 4-5, 2015
In partnership with La Maison de la Simulation
http://www.maisondelasimulation.fr/
ProgramHwloc: locality management tool/lib
StarPU: task scheduler
EZtrace: trace-based debugging framework
Kaapi: task scheduler
Olivier Aumage – STORM Team – Cilk – 5. Related Works 49
top related