profile guided deployment of stream programs on multicores s. m. farhad the university of sydney...
TRANSCRIPT
Profile Guided Deployment of Stream
Programs on MulticoresS. M. Farhad
The University of Sydney
Joint work with
Yousun Ko
Bernd Burgstaller
Bernhard Scholz
2
Outline
Motivation Multicore trend Stream programming
Research Questions How to profiling communication overhead on
Multicores? How to deploy stream programs?
Related works
2
3
Motivation
1
1975
2
4
8
16
32
64
128
256
512
1980 1985 1990 1995 2000 2005 2010
400480088080 8086 286 386 486 Pentium P2 P3 P4
Athlon Itanium Itanium2
Power4 PA8800400480088080
PA8800
Opteron CoreDuo
Power6Xbox 360
BCM 1480Opteron 4P
Xeon
Niagara Cell
RAW
RAZA XLR Cavium
Unicore
Homogeneous Multicore
Heterogeneous MulticoreCISCO CSR1
Larrabee
PicoChip AMBRIC
AMD Fusion
NVIDIA G80
Core
Core2Duo
Core2Quad
# co
res/
chip
Courtesy: Scott’08
C/C++/Java
CUDA
X10Peakstream
Fortress
Accelerator
Ct
C T M
Rstream
Rapidmind
Stream Programming
3
4
Stream Programming Paradigm Programs expressed as stream
graphs
Streams: Infinite sequence of data elements
Actors: Functions applied to streams
4
Actor
Stream
Stream
5
Properties of Stream Program Regular and repeating
computation Independent actors with explicit
communication Producer / Consumer
dependencies
5
Adder
Speaker
AtoD
FMDemod
LPF1
Splitter
Joiner
LPF2 LPF3
HPF1 HPF2 HPF3
6
StreamIt Language
An implementation of stream prog.
Hierarchical structure
Each construct has single input/output stream
parallel computation
may be any StreamIt language construct
joinersplitter
pipeline
feedback loop
joiner splitter
splitjoin
filter
6
7
Outline
Motivation Multicore trend Stream programming
Research Questions How to profiling communication overhead on
Multicores? How to deploy stream programs?
Related works
7
How to Estimate the Communication Overhead on Multicores?
8
Problems to Measure Communication Overhead on Multicores Reasons:
Multicores are non-communication exposed architecture
Complex cache hierarchy Cache coherence protocols
Consequence: Cannot directly measure the communication cost Estimate the communication cost by measuring
the execution time of actors
9
Measuring the Communication Overhead of an Edge
10
i k
Processor 1
No communication cost
Processor 1
With communication cost
Processor 2
ki
kkiiki ttttC ),(
it ktit kt
How to Minimize the Required Number of Experiments
11
A
B
C
1
2
Pipeline
GraphColoring
Requires2+1 Steps
A
B
C
D
Processor 1 Processor 2
1
2
3
E
F
5
4
Even edgesacross partition
Processor 1
A
D
B
C
E
Processor 2
1
3
2
4
Odd edgesacross partition
Obs. 1: There is no loop of three actors in a stream graph
12
i k
l
Processor 1 Processor 2
Obs. 2: There is no interference of adjacent nodes between edges
13
A
B
C D
E
F
For blue color edges
P-1
P-2
P-3
P-4
Remove Interference
Convert to a line graph
Add interference edges
Use vertex coloring algorithm
14
A
B
C D
E
F
AB
BC
BDCE
DE
EF
Line graphStream graph
AB
BC
BDCE
DE
EF
Processor Leveling Graph
15
A
B
C D
E
F
For blue colored edge Processor leveling graph
A
B, C, D, E
F
Coloring the Processor Labelling Graph
16
A
B, C, D, E
F
Processor 2Processor 1
A
B, C, D, E
F
A
B, C, D, E
F
Measuring the Communication Cost
17
A
B
C D
E
F
A
B, C, D, E
F
Processor 2Processor 1
)()(
)()(
),(
),(
FFEEFE
BBAABA
ttttC
ttttC
At
Bt
Et
Ft
For blue colored edge
Profiling Performance
Benchmark Total Edge Prof Steps Steps/Edge (%) Err (%)SAR 44 3 7 10MatrixMult 88 21 24 17MergeSort 37 4 11 31FMRadio 21 3 14 24DCT 28 9 32 14RadixSort 12 2 17 5FFT 26 3 12 27MPEG 56 17 30 15Channel 22 6 27 11BeamFormer 39 5 13 13
GM 17% 15%
18
19
Outline
Motivation Multicore trend Stream programming
Research Questions How to profiling communication overhead? How to deploy stream programs?
Related works
19
Deployment of Stream Programs
20
A (5)
B (40)
C (40)
D(5)
Processor 1 Processor 2
25
25
5
5
25
25
A (5)
B (40)
C (40)
D(5)
525
25
25
255
Load = (5 + 40) + 5 = 50
Load = (40 + 5) + 5 = 50
Makespan = 50, Speedup = 90/50 = 1.8
Deploying Stream Programs without Considering Communication
21
A (5)
B (40)
C (40)
D(5)
Processor 1 Processor 2
A (5)
C (40)
B (40)
D(5)
5
25
25
5
5
25
25
2525
2525
5
Load = (5+40) + (25+5+25) = 100
Load = (40+5) + (25+5+25) = 100
Makespan = 100, Speedup = 90/100 = 0.9
Compare = (100 – 50)x100%/50 = 100%
Deployment Performance
Benchmark m (us) m (us) (m – m)/m%
SAR 45.54 45.54 0
MatrixMult 67.80 111.14 64
MergeSort 1.63 6.99 329
FMRadio 1.57 7.00 346
DCT 4.64 7.68 66
RadixSort 1.49 3.08 107
FFT 18.28 34.15 87
MPEG 37.26 37.26 0
Channel 89.00 91.20 2
BeamFormer 7.29 7.29 0
22
Speedups obtained for 2, 4 and 6 processors
23
Summary
We propose an efficient profiling technique for multicore that minimizes profiling steps
We propose ILP based approach that minimizes the makespan
We conducted experiments The number of profiling steps is on the average only
17% The profiling scheme shows only 15% error on the
average in the random mapping test Obtains speedup of 3.11x for 4 processors and a
speedup of 4.02x for 6 processors
24
25
Related Works
[1] Static Scheduling of SDF Programs for DSP [Lee ‘87]
[2] StreamIt: A language for streaming applications [Thies ‘02]
[3] Phased Scheduling of Stream Programs [Thies ’03]
[4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in
Stream Programs [Thies ‘06]
[5] Orchestrating the Execution of Stream Programs on Cell [Scott ’08]
[6] Software Pipelined Execution of Stream Programs on GPUs
[Udupa‘09]
[7] Synergistic Execution of Stream Programs on Multicores with
Accelerators [Udupa ‘09]
[8] Orchestration by approximation [Farhad ‘11]
25
Questions?
Minimizing Errors in Profiling Process Errors are likely in any profiling process We chose an architecture which has uniform
cache hierarchy We pin the threads using likwidpin tools
27
Cache Topology of Processor
28
Core #0 Core #1 Core #2 Core #3 Core #4 Core #5
L1: 64kB L1: 64kB L1: 64kB L1: 64kB L1: 64kB L1: 64kB
L2: 512kB
L2: 512kB
L2: 512kB
L2: 512kB
L2: 512kB
L2: 512kB
L3: 6MB
800MHz hexa-core AMD Phenom(tm) II X6 1090T