1
Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in
Stream Programs
Michael Gordon, William Thies, andSaman Amarasinghe
Massachusetts Institute of Technology
ASPLOS October 2006San Jose, CA
http://cag.csail.mit.edu/streamit
2
Multicores Are Here!
1985 199019801970 1975 1995 2000
4004
8008
80868080 286 386 486 Pentium P2 P3P4Itanium
Itanium 2
2005 20??
# ofcores
1
2
4
8
16
32
64
128256
512
Athlon
Raw
Power4 Opteron
Power6
Niagara
YonahPExtreme
Tanglewood
Cell
IntelTflops
Xbox360
CaviumOcteon
RazaXLR
PA-8800
CiscoCSR-1
PicochipPC102
Broadcom 1480 Opteron 4PXeon MP
AmbricAM2045
3
Multicores Are Here!
For uniprocessors,C was:•Portable•High Performance•Composable•Malleable•Maintainable
Uniprocessors:C is the commonmachine language
1985 199019801970 1975 1995 2000
4004
8008
80868080 286 386 486 Pentium P2 P3P4Itanium
Itanium 2
2005
Raw
Power4 Opteron
Power6
Niagara
YonahPExtreme
Tanglewood
Cell
IntelTflops
Xbox360
CaviumOcteon
RazaXLR
PA-8800
CiscoCSR-1
PicochipPC102
Broadcom 1480
20??
# ofcores
1
2
4
8
16
32
64
128256
512
Opteron 4PXeon MP
Athlon
AmbricAM2045
4
Multicores Are Here!
What is the commonmachine languagefor multicores?
1985 199019801970 1975 1995 2000
4004
8008
80868080 286 386 486 Pentium P2 P3P4Itanium
Itanium 2
2005
Raw
Power4 Opteron
Power6
Niagara
YonahPExtreme
Tanglewood
Cell
IntelTflops
Xbox360
CaviumOcteon
RazaXLR
PA-8800
CiscoCSR-1
PicochipPC102
Broadcom 1480
20??
# ofcores
1
2
4
8
16
32
64
128256
512
Opteron 4PXeon MP
Athlon
AmbricAM2045
5
Common Machine Languages
Single memory imageSingle flow of controlCommon PropertiesUniprocessors:
ISA
Functional Units
Register FileDifferences:
Register AllocationInstruction Selection
Instruction Scheduling
Multiple local memoriesMultiple flows of controlCommon PropertiesMulticores:
Communication Model
Synchronization Model
Number and capabilities of coresDifferences:
von-Neumann languages represent the common properties and abstract away the differences
Need common machine language(s) for multicores
6
Streaming as a Common Machine Language
• Regular and repeating computation
• Independent filters with explicit communication– Segregated address spaces and
multiple program counters
• Natural expression of Parallelism:– Producer / Consumer dependencies– Enables powerful, whole-program
transformations Adder
Speaker
AtoD
FMDemod
LPF1
Scatter
Gather
LPF2 LPF3
HPF1 HPF2 HPF3
7
Types of Parallelism
Task Parallelism– Parallelism explicit in algorithm– Between filters without
producer/consumer relationship
Data Parallelism– Peel iterations of filter, place within
scatter/gather pair (fission)– parallelize filters with state
Pipeline Parallelism– Between producers and consumers– Stateful filters can be parallelized
Scatter
Gather
Task
8
Types of Parallelism
Task Parallelism– Parallelism explicit in algorithm– Between filters without
producer/consumer relationship
Data Parallelism– Between iterations of a stateless filter – Place within scatter/gather pair (fission)– Can’t parallelize filters with state
Pipeline Parallelism– Between producers and consumers– Stateful filters can be parallelized
Scatter
Gather
Scatter
Gather
Task
Pip
elin
e
Data
Data Parallel
9
Types of Parallelism
Traditionally:
Task Parallelism– Thread (fork/join) parallelism
Data Parallelism– Data parallel loop (forall)
Pipeline Parallelism– Usually exploited in hardware
Scatter
Gather
Scatter
Gather
Task
Pip
elin
e
Data
10
Problem Statement
Given: – Stream graph with compute and communication
estimate for each filter– Computation and communication resources of
the target machine
Find:– Schedule of execution for the filters that best
utilizes the available parallelism to fit the machine resources
11
Our 3-Phase Solution
1. Coarsen: Fuse stateless sections of the graph2. Data Parallelize: parallelize stateless filters3. Software Pipeline: parallelize stateful filters
Compile to a 16 core architecture– 11.2x mean throughput speedup over single core
Coarsen Granularity
Data Parallelize
Software Pipeline
12
Outline
• StreamIt language overview• Mapping to multicores
– Baseline techniques– Our 3-phase solution
13
• Applications– DES and Serpent [PLDI 05]– MPEG-2 [IPDPS 06]– SAR, DSP benchmarks, JPEG, …
• Programmability– StreamIt Language (CC 02) – Teleport Messaging (PPOPP 05)– Programming Environment in Eclipse (P-PHEC 05)
• Domain Specific Optimizations– Linear Analysis and Optimization (PLDI 03)– Optimizations for bit streaming (PLDI 05)– Linear State Space Analysis (CASES 05)
• Architecture Specific Optimizations– Compiling for Communication-Exposed
Architectures (ASPLOS 02)– Phased Scheduling (LCTES 03)– Cache Aware Optimization (LCTES 05)– Load-Balanced Rendering
(Graphics Hardware 05)
StreamIt Program
Front-end
Stream-AwareOptimizations
Uniprocessorbackend
Clusterbackend
Rawbackend
IBM X10backend
C/C++ C per tile +msg code
StreamingX10 runtime
Annotated Java
MPI-likeC/C++
Simulator(Java Library)
The StreamIt Project
14
Model of Computation
• Synchronous Dataflow [Lee ‘92]– Graph of autonomous filters– Communicate via FIFO channels
• Static I/O rates– Compiler decides on an order
of execution (schedule)– Static estimation of
computation
A/D
Duplicate
LED
Detect
Band Pass
LED
Detect
LED
Detect
LED
Detect
15
Example StreamIt Filter0 1 2 3 4 5 6 7 8 9 10 11 input
output
FIR0 1
float→float filter FIR (int N, float[N] weights) {
work push 1 pop 1 peek N {float result = 0;
for (int i = 0; i < N; i++) {result += weights[i] ∗ peek(i);
}pop();push(result);
}}
Stateless
16
Example StreamIt Filter
float→float filter FIR (int N, ) {
work push 1 pop 1 peek N {float result = 0;
for (int i = 0; i < N; i++) {result += weights[i] ∗ peek(i);
}pop();push(result);
}}
0 1 2 3 4 5 6 7 8 9 10 11 input
output
FIR0 1
weights = adaptChannel(weights);
float[N] weights;
(int N) {
Stateful
17
parallel computation
StreamIt Language Overview• StreamIt is a novel
language for streaming– Exposes parallelism and
communication– Architecture independent– Modular and composable
– Simple structures composed to creates complex graphs
– Malleable– Change program behavior
with small modifications
may be any StreamIt language construct
joinersplitter
pipeline
feedback loop
joiner splitter
splitjoin
filter
18
Outline
• StreamIt language overview• Mapping to multicores
– Baseline techniques– Our 3-phase solution
19
Baseline 1: Task Parallelism
Adder
Splitter
Joiner
Compress
BandPass
Expand
Process
BandStop
Compress
BandPass
Expand
Process
BandStop
• Inherent task parallelism between two processing pipelines
• Task Parallel Model:– Only parallelize explicit task
parallelism – Fork/join parallelism
• Execute this on a 2 core machine ~2x speedup over single core
• What about 4, 16, 1024, … cores?
20
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Bitonic
Sort
Chann
elVoc
oder
DCT
DES
FFT
Filterba
nk
FMRadio
Serpe
nt
TDEMPEG2D
ecod
er
Vocod
er
Radar
Geometr
ic Mea
nTh
roug
hput
Nor
mal
ized
to S
ingl
e C
ore
Stre
amIt
Evaluation: Task ParallelismRaw Microprocessor
16 inorder, single-issue cores with D$ and I$16 memory banks, each bank with DMA
Cycle accurate simulator
Parallelism: Not matched to target!Synchronization: Not matched to target!
21
Baseline 2: Fine-Grained Data Parallelism
Adder
Splitter
Joiner
• Each of the filters in the example are stateless
• Fine-grained Data Parallel Model:– Fiss each stateless filter N
ways (N is number of cores)– Remove scatter/gather if
possible
• We can introduce data parallelism– Example: 4 cores
• Each fission group occupies entire machineBandStopBandStopBandStopAdder
Splitter
Joiner
ExpandExpandExpand
ProcessProcessProcess
Joiner
BandPassBandPassBandPass
CompressCompressCompress
BandStopBandStopBandStop
Expand
BandStop
Splitter
Joiner
Splitter
Process
BandPass
Compress
Splitter
Joiner
Splitter
Joiner
Splitter
Joiner
ExpandExpandExpand
ProcessProcessProcess
Joiner
BandPassBandPassBandPass
CompressCompressCompress
BandStopBandStopBandStop
Expand
BandStop
Splitter
Joiner
Splitter
Process
BandPass
Compress
Splitter
Joiner
Splitter
Joiner
Splitter
Joiner
22
Evaluation:Fine-Grained Data Parallelism
0
1
2
34
5
6
7
8
910
11
12
13
14
1516
17
18
19
Bitonic
SortCha
nnelVoc
oder
DCT
DES
FFT
Filterban
k
FMRadio
Serpen
t
TDEMPEG2Deco
der
Vocod
er
Radar
Geometr
ic Mea
nTh
roug
hput
Nor
mal
ized
to S
ingl
e C
ore
Stre
amIt
TaskFine-Grained Data
Good Parallelism!Too Much Synchronization!
23
Outline
• StreamIt language overview• Mapping to multicores
– Baseline techniques– Our 3-phase solution
24
Phase 1: Coarsen the Stream GraphSplitter
Joiner
Expand
BandStop
Process
BandPass
Compress
Expand
BandStop
Process
BandPass
Compress
• Before data-parallelism is exploited
• Fuse stateless pipelines as much as possible without introducing state– Don’t fuse stateless with
stateful– Don’t fuse a peeking filter with
anything upstreamPeek Peek
PeekPeek
Adder
25
Splitter
Joiner
BandPassCompressProcessExpand
BandPassCompressProcessExpand
BandStop BandStop
Adder
• Before data-parallelism is exploited
• Fuse stateless pipelines as much as possible without introducing state– Don’t fuse stateless with
stateful– Don’t fuse a peeking filter with
anything upstream
• Benefits:– Reduces global communication
and synchronization– Exposes inter-node
optimization opportunities
Phase 1: Coarsen the Stream Graph
26
Phase 2: Data Parallelize
AdderAdderAdder
Splitter
Joiner
Adder
BandPassCompressProcessExpand
BandPassCompressProcessExpand
BandStop BandStop
Splitter
Joiner
Fiss 4 ways, to occupy entire chip
Data Parallelize for 4 cores
27
Phase 2: Data Parallelize
AdderAdderAdder
Splitter
Joiner
Adder
BandPassCompressProcessExpand
Splitter
Joiner
BandPassCompressProcessExpand
BandPassCompressProcessExpand
Splitter
Joiner
BandPassCompressProcessExpand
BandStop BandStop
Splitter
Joiner
Task parallelism!Each fused filter does equal workFiss each filter 2 times to occupy entire chip
Data Parallelize for 4 cores
28
BandStop BandStop
Phase 2: Data Parallelize
AdderAdderAdder
Splitter
Joiner
Adder
BandPassCompressProcessExpand
Splitter
Joiner
BandPassCompressProcessExpand
BandPassCompressProcessExpand
Splitter
Joiner
BandPassCompressProcessExpand
Splitter
Joiner
BandStop
Splitter
Joiner
BandStop
Splitter
Joiner
Task parallelism, each filter does equal workFiss each filter 2 times to occupy entire chip
• Task-conscious data parallelization– Preserve task parallelism
• Benefits:– Reduces global communication
and synchronization
Data Parallelize for 4 cores
29
Evaluation: Coarse-Grained Data Parallelism
012
34567
89
1011
1213141516
171819
Bitonic
SortCha
nnelVocoder
DCT
DES
FFT
Filterba
nk
FMRadio
Serpent
TDEMPEG2Dec
oder
Vocoder
Radar
Geometric
Mean
Thro
ughp
ut N
orm
aliz
ed to
Sin
gle
Cor
e St
ream
It
TaskFine-Grained DataCoarse-Grained Task + Data
Good Parallelism!Low Synchronization!
30
Simplified Vocoder
RectPolar
Splitter
Joiner
AdaptDFT AdaptDFT
Splitter
Splitter
Amplify
Diff
UnWrap
Accum
Amplify
Diff
Unwrap
Accum
Joiner
Joiner
PolarRect
66
20
2
1
1
1
2
1
1
1
20 Data Parallel
Data Parallel
Target a 4 core machine
Data Parallel, but too little work!
31
Data Parallelize
RectPolarRectPolarRectPolar
Splitter
Joiner
AdaptDFT AdaptDFT
Splitter
Splitter
Amplify
Diff
UnWrap
Accum
Amplify
Diff
Unwrap
Accum
Joiner
RectPolar
Splitter
Joiner
RectPolarRectPolarRectPolarPolarRect
Splitter
Joiner
Joiner
66
20
2
1
1
1
2
1
1
1
20
5
5
Target a 4 core machine
32
Data + Task Parallel Execution
Time
Cores
21
Target 4 core machine
Splitter
Joiner
Splitter
Splitter
Joiner
Splitter
Joiner
RectPolarSplitter
Joiner
Joiner
66
2
1
1
1
2
1
1
1
5
5
33
We Can Do Better!
Time
Cores
Target 4 core machine
Splitter
Joiner
Splitter
Splitter
Joiner
Splitter
Joiner
RectPolarSplitter
Joiner
Joiner
66
2
1
1
1
2
1
1
1
5
5
16
34
Phase 3: Coarse-Grained Software Pipelining
RectPolar
RectPolar
RectPolar
RectPolar
Prologue
New Steady
State
• New steady-state is free of dependencies
• Schedule new steady-state using a greedy partitioning
35
Greedy Partitioning
Target 4 core machine
Time 16
CoresTo Schedule:
36
0123456789
10111213141516171819
Bitonic
SortCha
nnelVocoder
DCT
DES
FFT
Filterba
nk
FMRadio
Serpent
TDEMPEG2Dec
oder
Vocoder
Radar
Geometric
MeanTh
roug
hput
Nor
mal
ized
to S
ingl
e C
ore
Stre
amIt
Task Fine-Grained DataCoarse-Grained Task + Data Coarse-Grained Task + Data + Software Pipeline
Evaluation: Coarse-Grained Task + Data + Software Pipelining
Best Parallelism!Lowest Synchronization!
37
Generalizing to Other Multicores
• Architectural requirements:– Compiler controlled local memories with DMA– Efficient implementation of scatter/gather
• To port to other architectures, consider:– Local memory capacities – Communication to computation tradeoff
• Did not use processor-to-processor communication on Raw
38
Related Work
• Streaming languages:– Brook [Buck et al. ’04]– StreamC/KernelC [Kapasi ’03, Das et al. ’06]– Cg [Mark et al. ‘03]– SPUR [Zhang et al. ‘05]
• Streaming for Multicores:– Brook [Liao et al., ’06]
• Ptolemy [Lee ’95]• Explicit parallelism:
– OpenMP, MPI, & HPF
39
Conclusions
• Good speedups across varied benchmark suite• Algorithms should be applicable across multicores
Low
Good
Coarse-Grained Task + Data
High
Good
Fine-Grained Data
LowestNot matched Synchronization
Best Not matchedParallelism
Coarse-Grained Task + Data +
Software PipelineTask
• Streaming model naturally exposes task, data, and pipeline parallelism
• This parallelism must be exploited at the correct granularity and combined correctly