flexible filters for high performancehigh performance ...€¦ · stream programmingstream...
TRANSCRIPT
Flexible Filters for High PerformanceHigh Performance
Embedded ComputingRebecca Collins and Luca CarloniDepartment of Computer ScienceDepartment of Computer Science
Columbia University
MotivationMotivation
Stream ProgrammingStream Programming
CBA
• Stream Programming model• Stream Programming model– filter: a piece of sequential programming– channels: how filters communicate– token: an indivisible unit of data for a filter
• Examples: Signal processing, image processing, embedded applicationsembedded applications
Stream ProgrammingStream Programming
CBA
• Stream Programming model• Stream Programming model– filter: a piece of sequential programming– channels: how filters communicate– token: an indivisible unit of data for a filter
• Examples: Signal processing, image processing, embedded applicationsembedded applications
Stream ProgrammingStream Programming
CBA
• Stream Programming model• Stream Programming model– filter: a piece of sequential programming– channels: how filters communicate– token: an indivisible unit of data for a filter
• Examples: Signal processing, image processing, embedded applicationsembedded applications
Stream ProgrammingStream Programming
CBA
• Stream Programming model• Stream Programming model– filter: a piece of sequential programming– channels: how filters communicate– token: an indivisible unit of data for a filter
• Examples: Signal processing, image processing, embedded applicationsembedded applications
Stream ProgrammingStream Programming
CBA
• Stream Programming model• Stream Programming model– filter: a piece of sequential programming– channels: how filters communicate– token: an indivisible unit of data for a filter
• Examples: Signal processing, image processing, embedded applicationsembedded applications
Stream ProgrammingStream Programming
CBA
• Stream Programming model• Stream Programming model– filter: a piece of sequential programming– channels: how filters communicate– token: an indivisible unit of data for a filter
• Examples: Signal processing, image processing, embedded applicationsembedded applications
Stream ProgrammingStream Programming
CBA
• Stream Programming model• Stream Programming model– filter: a piece of sequential programming– channels: how filters communicate– token: an indivisible unit of data for a filter
• Examples: Signal processing, image processing, embedded applicationsembedded applications
Stream ProgrammingStream Programming
CBA
• Stream Programming model• Stream Programming model– filter: a piece of sequential programming– channels: how filters communicate– token: an indivisible unit of data for a filter
• Examples: Signal processing, image processing, embedded applicationsembedded applications
Stream ProgrammingStream Programming
CBA
• Stream Programming model• Stream Programming model– filter: a piece of sequential programming– channels: how filters communicate– token: an indivisible unit of data for a filter
• Examples: Signal processing, image processing, embedded applicationsembedded applications
Stream ProgrammingStream Programming
CBA
• Stream Programming model• Stream Programming model– filter: a piece of sequential programming– channels: how filters communicate– token: an indivisible unit of data for a filter
• Examples: Signal processing, image processing, embedded applicationsembedded applications
Stream ProgrammingStream Programming
CBA
• Stream Programming model• Stream Programming model– filter: a piece of sequential programming– channels: how filters communicate– token: an indivisible unit of data for a filter
• Examples: Signal processing, image processing, embedded applicationsembedded applications
Stream ProgrammingStream Programming
CBA
• Stream Programming model• Stream Programming model– filter: a piece of sequential programming– channels: how filters communicate– token: an indivisible unit of data for a filter
• Examples: Signal processing, image processing, embedded applicationsembedded applications
Stream ProgrammingStream Programming
CBA
• Stream Programming model• Stream Programming model– filter: a piece of sequential programming– channels: how filters communicate– token: an indivisible unit of data for a filter
• Examples: Signal processing, image processing, embedded applicationsembedded applications
Stream ProgrammingStream Programming
CBA
• Stream Programming model• Stream Programming model– filter: a piece of sequential programming– channels: how filters communicate– token: an indivisible unit of data for a filter
• Examples: Signal processing, image processing, embedded applicationsembedded applications
Example: Constant False-Alarm R (CFAR) D iRate (CFAR) Detection
guard cells cell under testguard cells cell under test
ith dditi l f tcompare to with additional factors• number of gates• threshold (μ)• number of guard cellsHPEC Challenge • number of guard cells• rows and other dimensional
datahttp://www.ll.mit.edu/HPECchallenge/
HPEC Challenge
CFAR BenchmarkuInt to right left align find
CFAR BenchmarkuInt tofloat square right
windowleft
window
add
aligndata
findtargets
add
Data Stream:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CFAR BenchmarkuInt to right left align find1
CFAR BenchmarkuInt tofloat square right
windowleft
window
add
aligndata
findtargets1
add
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CFAR BenchmarkuInt to right left align find2 1
CFAR BenchmarkuInt tofloat square right
windowleft
window
add
aligndata
findtargets2 1
add
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CFAR BenchmarkuInt to right left align find
1
23 1
CFAR BenchmarkuInt tofloat square right
windowleft
window
add
aligndata
findtargets23 1
add
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CFAR BenchmarkuInt to right left align find1
67891011
231213
CFAR Benchmark
cell under test
uInt tofloat square right
windowleft
window
add
aligndata
findtargets123910111213
8
cell under test add
right window
left window
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
do
CFAR BenchmarkuInt to right left align find1
67891011
231213
CFAR Benchmark
cell under test
uInt tofloat square right
windowleft
window
add
aligndata
findtargets123910111213
8tcell under test add
right window
left window
t
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
do
t
CFAR BenchmarkuInt to right left align find1
7891011
231213
CFAR Benchmark
cell under test
uInt tofloat square right
windowleft
window
add
aligndata
findtargets123910111213
8t6
cell under test add
right window
left window
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
do
t
MappingMapping
CBA
MappingMapping
1 2 3core 1 core 2 core 3
CBA
Th h t t fA B C Throughput: rate of processing data tokens
A B C
multi-core chip
MappingMapping
1 3core 1 core 3
CBA
Th h t t fA C Throughput: rate of processing data tokens
AB
C
multi-core chip
Unbalanced Flow
1 2 3
Unbalanced Flow
core 1 core 2 core 3
CBA
• Bottlenecks reduce throughputCaused by backpressure– Caused by backpressure• inherent algorithmic imbalances• data dependent computational spikes• data-dependent computational spikes
Unbalanced Flow
1 2 3
Unbalanced Flow
core 1 core 2 core 3
CBA
• Bottlenecks reduce throughputCaused by backpressure– Caused by backpressure• inherent algorithmic imbalances• data dependent computational spikes• data-dependent computational spikes
Unbalanced Flow
1 2 3
Unbalanced Flow
core 1 core 2 core 3
CBA
• Bottlenecks reduce throughputCaused by backpressure– Caused by backpressure• inherent algorithmic imbalances• data dependent computational spikes• data-dependent computational spikes
Unbalanced Flow
1 2 3
Unbalanced Flow
core 1 core 2 core 3
CBA
• Bottlenecks reduce throughputCaused by backpressure– Caused by backpressure• inherent algorithmic imbalances• data dependent computational spikes• data-dependent computational spikes
Unbalanced Flow
1 2 3
Unbalanced Flow
core 1 core 2 core 3
CBA
• Bottlenecks reduce throughputCaused by backpressure– Caused by backpressure• inherent algorithmic imbalances• data dependent computational spikes• data-dependent computational spikes
Unbalanced Flow
1 2 3
Unbalanced Flow
core 1 core 2 core 3
CBA
• Bottlenecks reduce throughputCaused by backpressure– Caused by backpressure• inherent algorithmic imbalances• data dependent computational spikes• data-dependent computational spikes
Unbalanced Flow
1 2 3
Unbalanced Flow
core 1 core 2 core 3
CBA
• Bottlenecks reduce throughputCaused by backpressure– Caused by backpressure• inherent algorithmic imbalances• data dependent computational spikes• data-dependent computational spikes
Unbalanced Flow
1 2 3
Unbalanced Flow
core 1 core 2 core 3
CBA
• Bottlenecks reduce throughputCaused by backpressure– Caused by backpressure• inherent algorithmic imbalances• data dependent computational spikes• data-dependent computational spikes
Unbalanced Flow
1 2 3
Unbalanced Flow
core 1 core 2 core 3
CBA
• Bottlenecks reduce throughputCaused by backpressure– Caused by backpressure• inherent algorithmic imbalances• data dependent computational spikes• data-dependent computational spikes
Unbalanced Flow
1 2 3
Unbalanced Flow
core 1 core 2 core 3
CBA
“wait”
• Bottlenecks reduce throughputCaused by backpressure– Caused by backpressure• inherent algorithmic imbalances• data dependent computational spikes• data-dependent computational spikes
Unbalanced Flow
1 2 3
Unbalanced Flow
core 1 core 2 core 3
CBA
“wait”
• Bottlenecks reduce throughputCaused by backpressure– Caused by backpressure• inherent algorithmic imbalances• data dependent computational spikes• data-dependent computational spikes
Data Dependent Execution Time: CFARCFAR
• Using Set 1 from the HPEC 1.3 % targets
35
CFAR Kernel Benchmark• Over a block of about 100
cells 15202530
Perc
ent
• Extra workload of 32 microseconds per target 0
510
2 128 254 380 506
P
7.3 % targets15
Execution Time (microseconds)
0.3 % targets80
5
10
Perc
ent
20
40
60Pe
rcen
t
00 96 192 288 384 480 576
Execution Time (microseconds)
0
20
2 128 255 381 508Execution Time (microseconds)
Data Dependent Execution TimeData Dependent Execution Time
Some other examples:p• Bloom Filters (Financial, Spam detection)• Compression (Image processing)
789
3456
Perc
ent
0123
0.000 0.001 0.002 0.003 0.004 0.005
Execution Time(s)
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Our Solution: Flexible FiltersOur Solution: Flexible Filters
1 2 3
flex-split flex-merge
BA Ccore 1 core 2 core 3
C
• Unused cycles on B are filled by working ahead on filter C with the data already present on BPush stream flow upstream of a bottleneck• Push stream flow upstream of a bottleneck
• Semantic Preservation
Related WorksRelated Works
• Static Compiler Optimizations• Static Compiler Optimizations– StreamIt [Gordon et al, 2006]
• Dynamic Runtime Load BalancingW k St li /Filt Mi ti– Work Stealing/Filter Migration
• [Kakulavarapu et al., 2001]• Cilk [Frigo et al., 1998]• Flux [Shah et al 2003]• Flux [Shah et al., 2003]• Borealis [Xing et al., 2005]
– Queue based load balancing• Diamond [Huston et al 2005] distributed search queue based load• Diamond [Huston et al., 2005] distributed search, queue based load
balancing, filter re-ordering
• Combination Static+Dynamic– FlexStream [Hormati et al., 2009] multiple competing programsFlexStream [Hormati et al., 2009] multiple competing programs
OutlineOutline
• IntroductionIntroduction• Design Flow of a Stream Program with
FlexibilityFlexibility • Performance• Implementation of Flexible Filters• Experimentsp
– CFAR Case Study
Design Flow of a Stream Program ith Fl ibl Filtwith Flexible Filters
Design streamDesign stream algorithm
Mapping• Filters
• Memory
Profile
Add Flexibility to Bottlenecks
Profile of CFAR filters on Cell
right window
uIntToFloat
compiler/design tools
0 1 2 3 4 5
find targets
add
Average Execution Time (microseconds) per 100 tokens
Design ConsiderationsDesign Considerations
Design streamDesign stream algorithm
Mapping• Filters
• Memory
Profile
Add Flexibility to Bottlenecks
Design ConsiderationsDesign Considerations
Design streamDesign stream algorithm
Mapping• Filters
• Memory
Profile
Add Flexibility to Bottlenecks
Design ConsiderationsDesign Considerations
Design streamDesign stream algorithm
Mapping• Filters
• Memory
Profile
Add Flexibility to Bottlenecks
Adding FlexibilityAdding Flexibility
Design streamDesign stream algorithm
Mapping• Filters
• Memory
Profile
Add Flexibility to Bottlenecks
Adding FlexibilityAdding Flexibility
Design streamDesign stream algorithm
Mapping• Filters
• Memory
Profile
Add Flexibility to Bottlenecks
Adding FlexibilityAdding Flexibility
Design streamDesign stream algorithm
Mapping• Filters
• Memory
Profile
Add Flexibility to Bottlenecks
Adding FlexibilityAdding Flexibility
Design streamDesign stream algorithm
Mapping• Filters
• Memory
Profile
Add Flexibility to Bottlenecks
OutlineOutline
• IntroductionIntroduction• Design Flow of a Stream Program with
FlexibilityFlexibility• Performance• Implementation of Flexible Filters• Experimentsp
– CFAR Case Study
Mapping Stream Programs M l i C Pl fto Multi-Core Platforms
BA CBA Ccore 1 core 2 core 3
core 1
BA Cpipeline mappingcore 2
BA C
core 1 core 3core 2
BA CBA C
SPMD mappingsharing a core
Throughput: SPMD MappingThroughput: SPMD Mapping
BA Ccore 1
1
BA Ct0 t1 t2 t3 t4 t5 t6
core 1
core 2
2 BA Ccore 2
core 3core 3
2
BA C
SPMD i3 tokens processed in 7 timesteps,ideal throughput = 3/7 = 0 429
3
SPMD mapping ideal throughput = 3/7 = 0.429
Throughput: SPMD MappingThroughput: SPMD Mapping
BA C Suppose EA = 2, EB = 2, EC = 3core 1
1 EA=2 EB=2 EC=3
BA Ct0 t1 t2 t3 t4 t5 t6
core 1
core 2
2 BA Ccore 2
core 3core 3
2
BA C
SPMD i3 tokens processed in 7 timesteps,ideal throughput = 3/7 = 0 429
3
SPMD mapping ideal throughput = 3/7 = 0.429
Throughput: SPMD MappingThroughput: SPMD Mapping
BA C Suppose EA = 2, EB = 2, EC = 3core 1
1 EA=2 EB=2 EC=3
BA Ct0 t1 t2 t3 t4 t5 t6
core 1 A1
core 2
2 BA Ccore 2
core 3
A1
A2
A3core 3
2
BA C
SPMD i
A3
3 tokens processed in 7 timesteps,ideal throughput = 3/7 = 0 429
3
SPMD mapping ideal throughput = 3/7 = 0.429
Throughput: SPMD MappingThroughput: SPMD Mapping
BA C Suppose EA = 2, EB = 2, EC = 3core 1
1 EA=2 EB=2 EC=3
BA Ct0 t1 t2 t3 t4 t5 t6
core 1 A1 B1
core 2
2 BA Ccore 2
core 3
A1
A2
A3
B1
B2
B3core 3
2
BA C
SPMD i
A3 B33
SPMD mapping
Throughput: SPMD MappingThroughput: SPMD Mapping
BA C Suppose EA = 2, EB = 2, EC = 3core 1
1 EA=2 EB=2 EC=3
BA Ct0 t1 t2 t3 t4 t5 t6
core 1 A1 B1 C1
core 2
2 BA Ccore 2
core 3
A1
A2
A3
B1
B2
B3
C1
C2
C3core 3
2
BA C
SPMD i
A3 B3 C3
3 tokens processed in 7 timesteps,ideal throughput = 3/7 = 0 429
3
SPMD mapping ideal throughput = 3/7 = 0.429
Throughput: Pipeline MappingThroughput: Pipeline Mapping
CBAcore 1 core 2 core 3
latency 2 latency 2 latency 3
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
core 1core 1core 2core 3core 3
Throughput: Pipeline MappingThroughput: Pipeline Mapping
CBAcore 1 core 2 core 3
1
latency 2 latency 2 latency 3
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
core 1 A1core 1core 2core 3
A1
core 3
Throughput: Pipeline MappingThroughput: Pipeline Mapping
CBAcore 1 core 2 core 3
1 12
latency 2 latency 2 latency 3
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
core 1 A1 A2core 1core 2core 3
A1 A2B1
core 3
Throughput: Pipeline MappingThroughput: Pipeline Mapping
CBAcore 1 core 2 core 3
1 12 123
latency 2 latency 2 latency 3
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
core 1 A1 A2 A3core 1core 2core 3
A1 A2B1
C1B2A3
core 3 C1
Throughput: Pipeline MappingThroughput: Pipeline Mapping
CBAcore 1 core 2 core 3
1 12 123 234
latency 2 latency 2 latency 3
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
core 1 A1 A2 A3 A4core 1core 2core 3
A1 A2B1
C1B2A3
C2B3A4
core 3 C1 C2throughput = 1/3 = 0.333 < 0.429
Data BlocksData Blocks
CBAcore 1 core 2 core 3
CBA
C
data block: group of data tokens
Data BlocksData Blocks
CBAcore 1 core 2 core 3
CBA
C
data block: group of data tokens
Data BlocksData Blocks
CBAcore 1 core 2 core 3
CBA
C
data block: group of data tokens
Data BlocksData Blocks
CBAcore 1 core 2 core 3
CBA
C
data block: group of data tokens
Data BlocksData Blocks
CBAcore 1 core 2 core 3
CBA
C
data block: group of data tokens
Data BlocksData Blocks
CBAcore 1 core 2 core 3
CBA
C
data block: group of data tokens
Data BlocksData Blocks
CBAcore 1 core 2 core 3
CBA
C
data block: group of data tokens
Data BlocksData Blocks
CBAcore 1 core 2 core 3
CBA
C
data block: group of data tokens
Data BlocksData Blocks
CBAcore 1 core 2 core 3
CBA
C
data block: group of data tokens
Throughput: Pipeline Augmented with Flexibilitywith Flexibility
core 1 core 2 core 3
CBA
C
1 1
C
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
core 1core 1core 2core 3core 3
throughput = 2/5 = 0.4 < 0.429 (but > 0.333)
Throughput: Pipeline Augmented with Flexibilitywith Flexibility
core 1 core 2 core 3
CBA
C
1
C
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
core 1 A1core 1core 2core 3
A1
core 3
throughput = 2/5 = 0.4 < 0.429 (but > 0.333)
Throughput: Pipeline Augmented with Flexibilitywith Flexibility
core 1 core 2 core 3
CBA
C
1 12
C
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
core 1 A1 A2core 1core 2core 3
A1 A2B1
core 3
throughput = 2/5 = 0.4 < 0.429 (but > 0.333)
Throughput: Pipeline Augmented with Flexibilitywith Flexibility
core 1 core 2 core 3
CBA
C
1 12 123
C
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
core 1 A1 A2 A3core 1core 2core 3
A1 A2B1
C1B2A3
core 3 C1throughput = 2/5 = 0.4 < 0.429 (but > 0.333)
Throughput: Pipeline Augmented with Flexibilitywith Flexibility
core 1 core 2 core 3
CBA
C
1 12 123
C
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
core 1 A1 A2 A3core 1core 2core 3
A1 A2B1
C1B2A3
C2core 3 C1
throughput = 2/5 = 0.4 < 0.429 (but > 0.333)
Throughput: Pipeline Augmented with Flexibilitywith Flexibility
core 1 core 2 core 3
CBA
C
1 12 123 234
C
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
core 1 A1 A2 A3 A4core 1core 2core 3
A1 A2B1
C1B2A3
C2B3A4
C2core 3 C1 C2
throughput = 2/5 = 0.4 < 0.429 (but > 0.333)
OutlineOutline
• IntroductionIntroduction• Design Flow of a Stream Program with
FlexibilityFlexibility• Performance• Implementation of Flexible Filters• Experimentsp
– CFAR Case Study
Flex-SplitFlex SplitFlex-split core 2 core 3p
pop data block b from in
n0 = available space on out0
n1 |b| n0CB
core 2 core 3
n1 = |b| - n0
send n0 to out0, n1 to out1
send n0 0’s, then n1 1’s to l
Cselect
select
flex split flex mergeC out
out1
in0
in1
out0in
Cout1
maintain ordering based on run-time state of queues
Flex-MergeFlex MergeFlex-merge core 2 core 3g
pop i from select
if i is 0, pop token from in0
if i is 1 pop token from in1CB
core 2 core 3
if i is 1, pop token from in1
push token to out
C
select
flex split flex mergeC out
out1
in0
in1
out0in
Cout1
Flex-MergeFlex MergeFlex-merge core 2 core 3g
pop i from select
if i is 0, pop token from in0
if i is 1 pop token from in1CB
core 2 core 3
if i is 1, pop token from in1
push token to out
C
select
flex split flex mergeC out
out1
in0
in1
out0in Overhead of Flexibility?C
out1
Multi-Channel Fl S li d Fl MFlex-Split and Flex-Merge
output channel 1
filter
output channel 2
…output channel n
Multi-Channel Fl S li d Fl MFlex-Split and Flex-Merge
flex merge output channel 1
flex split filterflex merge output channel 2
…flex merge output channel n
Multi-Channel Fl S li d Fl MFlex-Split and Flex-Merge
flex merge output channel 1
flex split filterflex merge output channel 2
…filterflex
flex merge output channel n
Multi-Channel Fl S li d Fl MFlex-Split and Flex-Merge
flex merge output channel 1
flex split filterflex merge output channel 2
…filterflex
selectflex merge output channel n
Multi-Channel Fl S li d Fl MFlex-Split and Flex-Merge
i h l 1filter
…input channel 1
input channel 2
input channel n
Multi-Channel Fl S li d Fl MFlex-Split and Flex-Merge
i h l 1filter
…input channel 1
input channel 2flex merge
flex split
Centralized
filterflexinput channel n
select
Multi-Channel Fl S li d Fl MFlex-Split and Flex-Merge
i h l 1filter
…input channel 1
input channel 2flex merge
flex split
Centralized
filterflexinput channel n
select
filterinput channel 1
flex mergeflex split
Distributed
filter
… filterflex
input channel 2flex merge
β flex split
te flexinput channel n β flex split
select
OutlineOutline
• IntroductionIntroduction• Design Flow of a Stream Program with
FlexibilityFlexibility• Performance• Implementation of Flexible Filters• Experimentsp
– CFAR Case Study
Cell BE ProcessorCell BE Processor• Distributed Memoryy• Heterogeneous
– 8 SIMD (SPU) cores– 1 PowerPC (PPU)
• Element Interconnect BBus– 4 rings– 205 Gb/s– 205 Gb/s
• Gedae Programming LanguageCommunication Layery
GedaeGedae• Commercial data-flow language and programming GUI • Performance analysis tools• Performance analysis tools
CFAR BenchmarkCFAR BenchmarkuInt to right left align finduInt tofloat square right
windowleft
window
add
aligndata
findtargets
add
Profile of CFAR filters on Cell
right window
uIntToFloat
find targets
add
0 1 2 3 4 5Average Execution Time (microseconds) per 100 tokens
Data DependencyData Dependency• By changing threshold, change % targets
– 1.3 % – 7.3 %
• Additional workload per target– 16 µs– 32 µs– 64 µs
% T t
1.3 7.3
% Targets
16 μs 0.82 1.4532 μs 1.06 1.3964 1 27 1 47
AdditionalWorkload
64 μs 1.27 1.47
More BenchmarksMore BenchmarksBenchmark Field Results
DedupInformation
TheoryRabin block/ max chunk size Speedup
4096/512 2.00
JPEG Image Processing
Image width x height128x128256 256
1.311 16Processing 256x256
512x5121.161.25
stocks/walks/timestepsValue-at-
RiskFinance 16/1024/1024
64/1024/1024128/1024/1024
0.981.561.55
ConclusionsConclusions
• Flexible filters – adapt to data dependent bottlenecks– distributed load balancing– provide speedup without modification to
original filters– can be implemented on top of general
stream languages