lecture 12 - streamit parallelizing...
TRANSCRIPT
Prof. Saman Amarasinghe, MIT. 1 6.189 IAP 2007 MIT
6.189 IAP 2007
Lecture 12
StreamIt Parallelizing Compiler
2 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Common Machine Language
● Represent common properties of architecturesNecessary for performance
● Abstract away differences in architecturesNecessary for portability
● Cannot be too complexMust keep in mind the typical programmer
● C and Fortran were the common machine languages for uniprocessors
Imperative languages are not the correct abstraction for parallel architectures.
● What is the correct abstraction for parallel multicore machines?
3 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Common Machine Language for Multicores
● Current offerings:OpenMPMPIHigh Performance Fortran
● Explicit parallel constructs grafted onto imperative language● Language features obscured:
ComposabilityMalleability Debugging
● Huge additional burden on programmer:Introducing parallelismCorrectness of parallelismOptimizing parallelism
4 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Explicit Parallelism
● Programmer controls details of parallelism!● Granularity decisions:
if too small, lots of synchronization and thread creation if too large, bad locality
● Load balancing decisionsCreate balanced parallel sections (not data-parallel)
● Locality decisionsSharing and communication structure
● Synchronization decisionsbarriers, atomicity, critical sections, order, flushing
● For mass adoption, we need a better paradigm:Where the parallelism is naturalExposes the necessary information to the compiler
5 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Unburden the Programmer
● Move these decisions to compiler!GranularityLoad BalancingLocalitySynchronization
● Hard to do in traditional languagesCan a novel language help?
6 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
● Regular and repeating computation● Synchronous Data Flow● Independent actors
with explicit communication● Data items have short lifetimes
Benefits:● Naturally parallel● Expose dependencies to compiler● Enable powerful transformations Adder
Speaker
AtoD
FMDemod
LPF1
Duplicate
RoundRobin
LPF2 LPF3
HPF1 HPF2 HPF3
Properties of Stream Programs
7 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Outline
● Why we need New Languages?● Static Schedule ● Three Types of Parallelism● Exploiting Parallelism
8 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
…push=2
Steady-State Schedule
● All data pop/push rates are constant● Can find a Steady-State Schedule
# of items in the buffers are the same before and the after executing the scheduleThere exist a unique minimum steady state schedule
● Schedule = { }
pop=3push=1
pop=2…
A B C
9 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
…push=2
…push=2
Steady-State Schedule
● All data pop/push rates are constant● Can find a Steady-State Schedule
# of items in the buffers are the same before and the after executing the scheduleThere exist a unique minimum steady state schedule
● Schedule = { A }
pop=3push=1
pop=2…
A B C
10 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
…push=2
…push=2
Steady-State Schedule
● All data pop/push rates are constant● Can find a Steady-State Schedule
# of items in the buffers are the same before and the after executing the scheduleThere exist a unique minimum steady state schedule
● Schedule = { A, A }
pop=3push=1
pop=2…
A B C
11 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
…push=2
Steady-State Schedule
● All data pop/push rates are constant● Can find a Steady-State Schedule
# of items in the buffers are the same before and the after executing the scheduleThere exist a unique minimum steady state schedule
● Schedule = { A, A, B }
pop=3push=1
pop=2…
pop=3push=1
A B C
12 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
…push=2
…push=2
Steady-State Schedule
● All data pop/push rates are constant● Can find a Steady-State Schedule
# of items in the buffers are the same before and the after executing the scheduleThere exist a unique minimum steady state schedule
● Schedule = { A, A, B, A }
pop=3push=1
pop=2…
A B C
13 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
…push=2
Steady-State Schedule
pop=3push=1
pop=2…
pop=3push=1
A B C
● All data pop/push rates are constant● Can find a Steady-State Schedule
# of items in the buffers are the same before and the after executing the scheduleThere exist a unique minimum steady state schedule
● Schedule = { A, A, B, A, B }
14 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
…push=2
Steady-State Schedule
● All data pop/push rates are constant● Can find a Steady-State Schedule
# of items in the buffers are the same before and the after executing the scheduleThere exist a unique minimum steady state schedule
● Schedule = { A, A, B, A, B, C }
pop=3push=1
pop=2…
pop=2…
A B C
15 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Initialization Schedule
● When peek > pop, buffer cannot be empty after firing a filter
● Buffers are not empty at the beginning/end of the steady state schedule
● Need to fill the buffers before starting the steady state execution
peek=4pop=3push=1
16 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Initialization Schedule
● When peek > pop, buffer cannot be empty after firing a filter
● Buffers are not empty at the beginning/end of the steady state schedule
● Need to fill the buffers before starting the steady state execution
peek=4pop=3push=1
peek=4pop=3push=1
17 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Outline
● Why we need New Languages?● Static Schedule ● Three Types of Parallelism● Exploiting Parallelism
18 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Types of Parallelism
Task ParallelismParallelism explicit in algorithmBetween filters without producer/consumer relationship
Data ParallelismPeel iterations of filter, place within scatter/gather pair (fission)parallelize filters with state
Pipeline ParallelismBetween producers and consumersStateful filters can be parallelized
Scatter
Gather
Task
19 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Types of Parallelism
Task ParallelismParallelism explicit in algorithmBetween filters without producer/consumer relationship
Data ParallelismBetween iterations of a stateless filter Place within scatter/gather pair (fission)Can’t parallelize filters with state
Pipeline ParallelismBetween producers and consumersStateful filters can be parallelized
Scatter
Gather
Scatter
Gather
Task
Pip
elin
e
Data
Data Parallel
20 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Types of Parallelism
Traditionally:
Task ParallelismThread (fork/join) parallelism
Data ParallelismData parallel loop (forall)
Pipeline ParallelismUsually exploited in hardware
Scatter
Gather
Scatter
Gather
Task
Pip
elin
e
Data
21 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Outline
● Why we need New Languages?● Static Schedule ● Three Types of Parallelism● Exploiting Parallelism
22 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Baseline 1: Task Parallelism
Adder
Splitter
Joiner
Compress
BandPass
Expand
Process
BandStop
Compress
BandPass
Expand
Process
BandStop
● Inherent task parallelism between two processing pipelines
● Task Parallel Model:Only parallelize explicit task parallelism Fork/join parallelism
● Execute this on a 2 core machine ~2x speedup over single core
● What about 4, 16, 1024, … cores?
23 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Bitonic
Sort
Chann
elVoc
oder
DCT
DES
FFT
Filterba
nk
FMRadio
Serpe
nt
TDEMPEG2D
ecod
er
Vocod
er
Radar
Geometr
ic Mea
nTh
roug
hput
Nor
mal
ized
to S
ingl
e C
ore
Stre
amIt
Evaluation: Task Parallelism
Raw Microprocessor16 inorder, single-issue cores with D$ and I$
16 memory banks, each bank with DMACycle accurate simulator
Parallelism: Not matched to target!Synchronization: Not matched to target!
24 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Baseline 2: Fine-Grained Data Parallelism
Adder
Splitter
Joiner
● Each of the filters in the example are stateless
● Fine-grained Data Parallel Model:Fiss each stateless filter N ways (N is number of cores)Remove scatter/gather if possible
● We can introduce data parallelismExample: 4 cores
● Each fission group occupies entire machine
BandStopBandStopBandStopAdderSplitter
Joiner
ExpandExpandExpand
ProcessProcessProcess
Joiner
BandPassBandPassBandPass
CompressCompressCompress
BandStopBandStopBandStop
Expand
BandStop
Splitter
Joiner
Splitter
Process
BandPass
Compress
Splitter
Joiner
Splitter
Joiner
Splitter
Joiner
ExpandExpandExpand
ProcessProcessProcess
Joiner
BandPassBandPassBandPass
CompressCompressCompress
BandStopBandStopBandStop
Expand
BandStop
Splitter
Joiner
Splitter
Process
BandPass
Compress
Splitter
Joiner
Splitter
Joiner
Splitter
Joiner
25 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Evaluation: Fine-Grained Data Parallelism
0
1
2
34
5
6
7
8
910
11
12
13
14
1516
17
18
19
Bitonic
SortCha
nnelVoc
oder
DCT
DES
FFT
Filterban
k
FMRadio
Serpen
t
TDEMPEG2Deco
der
Vocod
er
Radar
Geometr
ic Mea
nTh
roug
hput
Nor
mal
ized
to S
ingl
e C
ore
Stre
amIt
TaskFine-Grained Data
Good Parallelism!Too Much Synchronization!
26 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Baseline 3: Hardware Pipeline Parallelism
Adder
Splitter
Joiner
Compress
BandPass
Expand
Process
BandStop
Compress
BandPass
Expand
Process
BandStop
● The BandPass and BandStopfilters contain all the work
● Hardware PipeliningUse a greedy algorithm to fuse adjacent filtersWant # filters <= # cores
● Example: 8 Cores● Resultant stream graph is
mapped to hardwareOne filter per core
● What about 4, 16, 1024, cores?
27 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Baseline 3: Hardware Pipeline Parallelism
Adder
Splitter
Joiner
BandPass
CompressProcessExpand
BandStop
BandPass
BandStop
● The BandPass and BandStopfilters contain all the work
● Hardware PipeliningUse a greedy algorithm to fuse adjacent filtersWant # filters <= # cores
● Example: 8 Cores● Resultant stream graph is
mapped to hardwareOne filter per core
● What about 4, 16, 1024, cores?Performance dependent on fusing to a load-balanced stream graph
CompressProcessExpand
28 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Evaluation: Hardware Pipeline Parallelism
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Bitonic
SortCha
nnelVoc
oder
DCT
DES
FFT
Filterban
k
FMRadio
Serpen
t
TDEMPEG2Deco
der
Vocod
er
Radar
Geometr
ic Mea
n
Thro
ughp
ut N
orm
aliz
ed to
Sin
gle
Cor
e St
ream
It
TaskFine-Grained DataHardware Pipelining
Parallelism: Not matched to target!Synchronization: Not matched to target!
29 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
The StreamIt Compiler
1. Coarsen: Fuse stateless sections of the graph2. Data Parallelize: parallelize stateless filters3. Software Pipeline: parallelize stateful filters
Compile to a 16 core architecture11.2x mean throughput speedup over single core
Coarsen Granularity
Data Parallelize
Software Pipeline
30 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Phase 1: Coarsen the Stream Graph
Splitter
Joiner
Expand
BandStop
Process
BandPass
Compress
Expand
BandStop
Process
BandPass
Compress
● Before data-parallelism is exploited● Fuse stateless pipelines as much
as possible without introducing state
Don’t fuse stateless with statefulDon’t fuse a peeking filter with anything upstream
Peek Peek
PeekPeek
Adder
31 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Phase 1: Coarsen the Stream Graph
Splitter
Joiner
BandPassCompressProcessExpand
BandPassCompressProcessExpand
BandStop BandStop
Adder
● Before data-parallelism is exploited● Fuse stateless pipelines as much
as possible without introducing state
Don’t fuse stateless with statefulDon’t fuse a peeking filter with anything upstream
● Benefits:Reduces global communication and synchronizationExposes inter-node optimization opportunities
32 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Phase 2: Data Parallelize
AdderAdderAdder
Splitter
Joiner
Adder
BandPassCompressProcessExpand
BandPassCompressProcessExpand
BandStop BandStop
Splitter
Joiner
Fiss 4 ways, to occupy entire chip
Data Parallelize for 4 cores
33 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Phase 2: Data Parallelize
AdderAdderAdder
Splitter
Joiner
Adder
BandPassCompressProcessExpand
Splitter
Joiner
BandPassCompressProcessExpand
BandPassCompressProcessExpand
Splitter
Joiner
BandPassCompressProcessExpand
BandStop BandStop
Splitter
Joiner
Task parallelism!Each fused filter does equal workFiss each filter 2 times to occupy entire chip
Data Parallelize for 4 cores
34 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
BandStop BandStop
Phase 2: Data Parallelize
AdderAdderAdder
Splitter
Joiner
Adder
BandPassCompressProcessExpand
Splitter
Joiner
BandPassCompressProcessExpand
BandPassCompressProcessExpand
Splitter
Joiner
BandPassCompressProcessExpand
Splitter
Joiner
BandStop
Splitter
Joiner
BandStop
Splitter
Joiner
Task parallelism, each filter does equal workFiss each filter 2 times to occupy entire chip
● Task-conscious data parallelizationPreserve task parallelism
● Benefits:Reduces global communication and synchronization
Data Parallelize for 4 cores
35 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Evaluation: Coarse-Grained Data Parallelism
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Bitonic
SortCha
nnelVocoder
DCT
DES
FFTFilte
rbank
FMRadio
Serpent
TDEMPEG2Dec
oder
Vocoder
Radar
Geometric
MeanTh
roug
hput
Nor
mal
ized
to S
ingl
e C
ore
Stre
amIt
TaskFine-Grained DataHardware PipeliningCoarse-Grained Task + Data
Good Parallelism!Low Synchronization!
36 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Simplified Vocoder
RectPolar
Splitter
Joiner
AdaptDFT AdaptDFT
Splitter
Splitter
Amplify
Diff
UnWrap
Accum
Amplify
Diff
Unwrap
Accum
Joiner
Joiner
PolarRect
66
20
2
1
1
1
2
1
1
1
20 Data Parallel
Data Parallel
Target a 4 core machine
Data Parallel, but too little work!
37 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Data Parallelize
RectPolarRectPolarRectPolar
Splitter
Joiner
AdaptDFT AdaptDFT
Splitter
Splitter
Amplify
Diff
UnWrap
Accum
Amplify
Diff
Unwrap
Accum
Joiner
RectPolar
Splitter
Joiner
RectPolarRectPolarRectPolarPolarRect
Splitter
Joiner
Joiner
66
20
2
1
1
1
2
1
1
1
20
5
5
Target a 4 core machine
38 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Data + Task Parallel Execution
Time
Cores
21
Target 4 core machine
Splitter
Joiner
Splitter
Splitter
Joiner
Splitter
Joiner
RectPolarSplitter
Joiner
Joiner
66
2
1
1
1
2
1
1
1
5
5
39 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
We Can Do Better!
Time
Cores
Target 4 core machine
Splitter
Joiner
Splitter
Splitter
Joiner
Splitter
Joiner
RectPolarSplitter
Joiner
Joiner
66
2
1
1
1
2
1
1
1
5
5
16
40 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Phase 3: Coarse-Grained Software Pipelining
RectPolar
RectPolar
RectPolar
RectPolar
Prologue
New Steady
State
● New steady-state is free of dependencies
● Schedule new steady-state using a greedy partitioning
41 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Greedy Partitioning
Target 4 core machine
Time 16
CoresTo Schedule:
42 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
0123456789
10111213141516171819
Bitonic
SortCha
nnelVocoder
DCT
DES
FFT
Filterba
nk
FMRadio
Serpent
TDEMPEG2Dec
oder
Vocoder
Radar
Geometric
Mean
Thro
ughp
ut N
orm
aliz
ed to
Sin
gle
Cor
e St
ream
It
Task Fine-Grained DataHardware Pipelining Coarse-Grained Task + DataCoarse-Grained Task + Data + Software Pipeline
Evaluation: Coarse-Grained Task + Data + Software Pipelining
Best Parallelism!Lowest Synchronization!
43 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Summary
● Robust speedups across varied benchmark suite
ApplicationDependent
ApplicationDependent
HardwarePipelining
Low
Good
Coarse-Grained Task + Data
High
Good
Fine-Grained
Data
LowestApplicationDependent
Synchronization
Best ApplicationDependent
Parallelism
Coarse-Grained Task + Data + Software
PipelineTask
● Streaming model naturally exposes task, data, and pipeline parallelism
● This parallelism must be exploited at the correct granularity and combined correctly