search space properties for pipelined fpga applications
DESCRIPTION
Search Space Properties for Pipelined FPGA Applications. University of Southern California Information Sciences Institute Heidi Ziegler, Mary Hall, Byoungro So Oct 2, 2003. Mapping Assignment. Partition Chip Capacity. Compute Data Layout. Manage Communication. - PowerPoint PPT PresentationTRANSCRIPT
USCUSC
Search Space Properties Search Space Properties for Pipelined FPGA Applicationsfor Pipelined FPGA Applications
University of Southern CaliforniaInformation Sciences Institute
Heidi Ziegler, Mary Hall, Byoungro So
Oct 2, 2003
2
USCUSCMapping AssignmentMapping Assignment
Machine V ision K ernel(application requirem ents)
1 . Edge detection 2 . Feature extraction 3 . Distance com putation
FP GA
M ap
3
USCUSC
Machine V ision K ernel(MV I S)
1 . Edge detection 2 . Feature extraction 3 . Distance com putation
configurable logic element
off-chip memory
datapath
on-chip storage
I nterconnect
configurable logic
Mapping an Application to HardwareMapping an Application to Hardware
1
2
3
Compute Data Layout
Partition Chip Capacit
y
Manage Communicati
on
4
USCUSCBuild on Prior Work in Build on Prior Work in DEFACTODEFACTO
Automatic design space exploration for individual loop nests (DAC03, PLDI02)
Analyses and transformations to exploit ILP (PLDI02) and maximize memory bandwidth (LPCP02)
Communication and pipeline analysis to exploit data and task parallelism (FCCM02, DAC03)
C
Analyses and T ransformations
SU I F to VHDL
Behavioral Synthesis and Estimation
Good Design?
Logic Synthesis and P lace&Route
N o
Yes
5
USCUSCThis ResearchThis Research
Integrates communication and pipelining analysis with the single loop design space exploration
Defines and illustrates search space properties for the global optimization problem
Describes a search algorithm and presents a case study
6
USCUSCSequential MVIS KernelSequential MVIS Kernel
ReadWriteExecution Order
Time
AB
2-D array
access order row-wise
data dependen
ce
B
RAW
Edge
Feature
Distance
F
RAW
D
D
Pipeline Stage S1
Pipeline Stage S2
Pipeline Stage S3
7
USCUSCReaching Definition Data Access DescriptorReaching Definition Data Access Descriptor
Set describes basic data access information
s program pointr, w read or write array access
accessed array section, integer linear inequalities
traversal order, vector of dims., slowest to fastest
vector of dominant induction variables for ea. dim
set of statements this tuple describes (def or use)
set of reaching definitions
)(},,{ ARDAD swr
8
USCUSCCommunication RequirementsCommunication Requirements
Read (4)Write (3)
)(, ,, BRDADBRDADf sjrsiw
Stage S2
Stage S1
|3,2,129202910
)(1, yxdd
BRDAD sw
3|4,2,129202910
)(2, yxdd
BRDAD sr
B
B
Communication
RAW
Solve directly for data, granularity, placement
9
USCUSCTask GraphTask Graph Nodes are pipeline stages Communication edge descriptors (CEDs) computed from
RDADs
array section, per communication instance send point receive point
S 1
S 5
S 2
S 4
S 3
{R D AD s}s2
{R D AD s}s1
{R D AD s}s4
{R D AD s}s5
{R D AD s}s3
CE D s2 -> s3 (a )ra te s2 (a )p ro d
ra te s3 (a ) c o n s
CE D s2 -> s3 (b )rate s2 (a ) p ro d
ra te s3 (a ) c o n s
CE D s1 -> s2 (a )ra te s1 (a ) p ro d
ra te s2 (a ) c o n s
CE D s1 -> s5 (a )ra te s1 (a ) p ro d
ra te s5 (a ) c o n s
CE D s1 -> s4 (x)ra te s1 (x)p ro d
ra te s4 (x) c o n s
CE D s4 -> s5 (y)ra te s4 (y)p ro d
ra te s5 (y) c o n s
CE D s5 -> s3 (y)ra te s5 (y)p ro d
ra te s3 (y) c o n s
)(ACED ji ss
10
USCUSCGlobal Optimization StrategyGlobal Optimization Strategy
2 Criteria Design’s execution time should be
minimized Design’s space utilization, for a given level
of performance, should be minimized
Estimates Behavioral synthesis area (all loops) Behavioral synthesis timing (all loops) Communication rates
11
USCUSCTransformationsTransformations
Local Unroll and jam Scalar replacement Custom data layout
Global Communication granularity and
placement Producer-Consumer Rate Matching Data reorganization on-chip
12
USCUSCHigh-Level Design FlowHigh-Level Design FlowC
Communication and P ipeline Analysis
Custom Data Layout
SU I F to VHDL
Behavioral Synthesis and Estimation
Basic Compiler O ptimizations
Scalar Replacement
Unro ll and J am
Producer-Consumer Rate M atching
Communication Granularity Analysis
Logic Synthesis / P lace & Route
G ood D esig n ? N o
Y es
Con fig u ration B it S tream
13
USCUSCObservation 1: Observation 1: Non-increasing Memory AccessesNon-increasing Memory Accesses
Choose to place communication on-chip
off-chip memory
configurable logic device
Stage 1 AABB
Stage 2
S1
S2BB
DD
EE
BB
AA
Single Loop So lution Global So lution
DD
EE
14
USCUSCObservation 2: Observation 2: Non-increasing Unroll FactorNon-increasing Unroll Factor
Local solution assumed to be best-case performance, worst-case space estimate
Stage 1
S1
S2
Single Loop So lution Global So lution
Stage 2Reduce unroll factors
15
USCUSCObservation 3:Observation 3:Matching Rates without Affecting PerformanceMatching Rates without Affecting Performance
Avoid creating longer critical paths
S 1
S 3
S 2
If rateprod(d) < ratecons(d),we can safely reduce the unroll factor for S3
until the rates match
CED(d)rateprod(d)ratecons(d)
CED(a)rateprod(a)ratecons(a)
16
USCUSCOptimization Algorithm: Step Optimization Algorithm: Step 11
S 1
S 3
S 2
peak
feat
u re_
x
CE D s1 , s2 (p eak)
CE D s2 , s3 ( featu re_ x)
R D AD w ,s1 (p eak)
R D AD r,s2 (p eak)R D AD w ,s2 ( featu re_ x)
R D AD r,s3 ( featu re_ x)R D AD w ,s3 (ssd )
R D AD r,s1 (u )
R D AD r,s3 (u )R D AD r,s3 (v)
R D AD w ,s2 ( featu re_ y)
Apply Pipeline and Communication Analysisfor (x=0;x<image-2;x++) {
for (y=0;y<image-2;y++) {
uh1 = -3*u[x][y] – 3*u[x+1][y]……;
uh2 = -3*u[x][y] +3*u[x+1][y] …..;
peak[x][y] = uh1 + uh2;
}
}
for (x=0;x<image-2;x++) {
for (y=0;y<image-2;y++) {
if (feature_x[x][y] !=0)
ssd[x][y] = (u[x][y]-v[x][y+1])2 ……….
}
}
for (x=0;x<image-2;x++) {
for (y=0;y<image-2;y++) {
if (peak[x][y] > threshold)
feature_x[x][y] = x;
else feature_x[x][y] = 0;
}
}
17
USCUSCOptimization Algorithm: Step Optimization Algorithm: Step 22
Stage 1
Stage 2
Stage 3
S et o f U n ro ll F actors
S et o f U n ro ll F actors
S et o f U n ro ll F actors
peak
fea t
u re_
x
Find Single Loop Solutions in Isolationfor (x=0;x<image-2;x++) {
for (y=0;y<image-2;y++) {
uh1 = -3*u[x][y] – 3*u[x+1][y]……;
uh2 = -3*u[x][y] +3*u[x+1][y] …..;
peak[x][y] = uh1 + uh2;
}
}
for (x=0;x<image-2;x++) {
for (y=0;y<image-2;y++) {
if (feature_x[x][y] !=0)
ssd[x][y] = (u[x][y]-v[x][y+1])2 ……….
}
}
for (x=0;x<image-2;x++) {
for (y=0;y<image-2;y++) {
if (peak[x][y] > threshold)
feature_x[x][y] = x;
else feature_x[x][y] = 0;
}
}
18
USCUSCOptimization Algorithm: Optimization Algorithm: Step 3Step 3
Match Producer and Consumer Rates
S 1
S 3
S 2
CED(feature_x)rateprod(feature_x)ratecons(feature_x)
CED(peak)rateprod(peak)ratecons(peak)
rateprod(peak) = ratecons(peak)
rateprod(feature_x) = ratecons(feature_x)
19
USCUSCOptimization Algorithm: Step Optimization Algorithm: Step 44
Apply Greedy Strategy to Meet Chip Constraint
Stage 1
Stage 2
Stage 3
inareacapacity 1
If not, apply greedy strategy and then repeat steps 3 and 4.
Final Solution
20
USCUSCRelated WorkRelated Work Synthesizing high-level constructs
Handel-C, RaPiD, PipeRench, Babb et al.
Design space exploration Derrien/Rajopadhye, Cameron, PICO
Program analysis on arrays Hall et. al, Amarasinghe, Balasundaram &
Kennedy
Pipeline analysis Splash 2, Weinhardt & Luk, Du et. al, Goldstein et
al.
21
USCUSCConclusionConclusion
System-level compiler automatically derives a pipelined implementation with explicit communication, while partitioning the chip capacity among pipeline stages
Global optimization strategy Built upon local solution with communication
Constrain the search space Non-increasing memory accesses Non-increasing unroll factors
22
USCUSCContact InformationContact Information
Project Web Site
www.isi.edu/asd/defacto
Authors’ email addresses
ziegler, mhall, [email protected]