séminaire cosi ’01
DESCRIPTION
Séminaire COSI ’01. Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye. Content. Context and motivations Silicon compilation tools Target architectures Power consumption Related work Partitioning Modeling Power Experimental results Conclusion. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/1.jpg)
Séminaire COSI-Roscoff’01 1
Séminaire COSI ’01
Power Driven Processor Array Partitionning for FPGA SoC
S.Derrien, S. Rajopadhye
![Page 2: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/2.jpg)
Séminaire COSI-Roscoff’01 2
Content Context and motivations
Silicon compilation tools Target architectures Power consumption Related work
Partitioning Modeling Power Experimental results Conclusion
![Page 3: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/3.jpg)
Séminaire COSI-Roscoff’01 3
Silicon compilation tools Parallel processor array architectures
Regular and scalable (well suited to FPGAs) Specialized high-performance data-path
Restricted class of loops SUREs (uniform dependencies) Static polyhedral loop domain
Compute intensive nested loops Image processing (motion estimation, stereo vision) Signal processing (QR factorization, DLMS)
![Page 4: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/4.jpg)
Séminaire COSI-Roscoff’01 4
Power consumption General model and motivations
P=Pstat+Vdd.Cd.Df (gate level model) Estimate at RTL level (entropy based models)
Mainly dictated by : On chip area cost and activity Off-chip I/O volume
System level power model ? Estimate from specs and target arch.
![Page 5: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/5.jpg)
Séminaire COSI-Roscoff’01 5
Target architecture
FPGA
CPU
SystemMemory
Ext world
Embedded CPU Power PC NIOS
Soc bus Amba, Coreconnect Plug ’n play IP cores
Shared Memory Low latency High bandwidth
![Page 6: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/6.jpg)
Séminaire COSI-Roscoff’01 6
Related Work Compiler transformations to reduce mem
accesses [Kandemir] Loop fusion Loop tiling Loop reordering
Design space exploration for custom memory systems [Imec]
Systematic exploration Multi-level memory hierachy The approach is brute force
![Page 7: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/7.jpg)
Séminaire COSI-Roscoff’01 7
Content Context and motivations Target architectures Partitioning
Clustering (LSGP) Tiling (LPGS) Co-partitionning
modeling Power Experimental results Conclusion
![Page 8: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/8.jpg)
Séminaire COSI-Roscoff’01 8
Partition PE array into Tiles Tiles are executed sequentially Intermediate results stored in off-chip memory requires unidirectionnal communications :
Tile shape is rectangular Bound // to PE space base vectors Perfect « Tiling » of processor space
Tiling (LPGS)
![Page 9: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/9.jpg)
Séminaire COSI-Roscoff’01 9
Tiling (LPGS)
1
1
000000
H
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
Mux
PEPE
PE PE PE
PE
Mux
DeMux
DeMux
FIFO
FIFO
=2
=3
Matrix diagonal det||=Npe
domain height
![Page 10: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/10.jpg)
Séminaire COSI-Roscoff’01 10
Regroups PEs into Clusters operations executed sequentially I/O accesses reduced
Cluster shape is rectangular Bound // to PE space basis vectors Perfect « Tiling » of processor space
Scheduling is axes-major Several possible schedulings Seq. of clustering along each axis Simplifies control logic
Clustering (LSGP)
![Page 11: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/11.jpg)
Séminaire COSI-Roscoff’01 11
Clustering (LSGP)
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
y=3
y=2
PE
PE
PE
PE
Matrix diagonal det||=Npe
size yx…xx
xp ..
PE index vector Iteration index
vector
Original space-time mapping
1
1
000000
H
![Page 12: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/12.jpg)
Séminaire COSI-Roscoff’01 12
Clustering (LSGP)
+*
A
B C
+*
A
B C
+*
A
B C1 2 61
1 1
1 3
1
PE original x=2 x=2, y=3
Resource usage estimate :
![Page 13: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/13.jpg)
Séminaire COSI-Roscoff’01 13
Hybrid-partitioning Step1 : array is Tiled
Tune the I/O volume Step2 : Tile is clusteredArray
Tune the resource usage Trade-Off
Off-chip I/O Volume Local memory sizes
![Page 14: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/14.jpg)
Séminaire COSI-Roscoff’01 14
Content Context and motivations Target architectures Partitioning modeling Power
IO power model Core power model Putting it all together
Experimental results Conclusion
![Page 15: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/15.jpg)
Séminaire COSI-Roscoff’01 15
Dynamic IO Energy model IO Energy depends on
IO volume (Ram clock speed) Operation (Rd,Wr) Port Toggle rate
Eio=Krd.Vrd+ Kwr.Vwr
Determine IO volume For all loop variables Given tiling parameters
Number write I/O operations
Technological constant
![Page 16: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/16.jpg)
Séminaire COSI-Roscoff’01 16
Tile IO volume is called « foot print » Estimate for this foot print [Arg95] Spread vector of dependencies
IO Volume estimate (1/2)
: substituting ith row with spread vector
n
iaiAV
1
det
![Page 17: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/17.jpg)
Séminaire COSI-Roscoff’01 17
v
k
n
i
n
jjjijjik alVio
0 1 1,, )1(
Total Tile IO volume:
Example :dA=[1 0 0] aA=[1 0 0] lA=2 VA= 2.H.1
dB=[0 1 0] aB=[1 0 0] lB=2 VB= 2.H.
dC=[0 0 1] aC=[1 0 0] lC=4 VC=
IO Volume estimate (1/2)
kth variable byte widthNumber of variables
Tile size parameterSpread vector
dependenciesTile output data
dependenciesTile input data
BA
C
j i
k
![Page 18: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/18.jpg)
Séminaire COSI-Roscoff’01 18
FPGA power dissipation model Pcore=Pstat+Kc.Dlc.nlc.f
Not suited to our target FPGA architecture. Distinction between LCs (mem and logic)
Pcore=Pstat+Kc.Dlc.nlc.f+ Km.Dm.nm.f
Core power model (1/4)
Technology constant
Average toggle rate
Nbs of logic cells
Design operating freq.
![Page 19: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/19.jpg)
Séminaire COSI-Roscoff’01 19
Core power model (2/4) Control logic is not modeled
too complex to estimate no significant contribution to power
Core power depends on Number of PEs : depends on and Area usage for each PE : depends on Average toggle rate for PE datapath and local
memory (application constant)
![Page 20: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/20.jpg)
Séminaire COSI-Roscoff’01 20
Core power model (3/4) Memory ressource usage
LCs used as distributed memory (16x1bits) Datapath is design constant (library based)
Area cost for a PE array
Clustering parameter along processor space j
Register width along processor space k
Datapath functional cost
Number of PEs
fpd AnA
detdet
pn
16A 1
0m
p
kjjp
kkp An
![Page 21: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/21.jpg)
Séminaire COSI-Roscoff’01 21
Core power model (4/4) Energy cost for the whole loop nest
we have Ec=Pc.ncycle.Tcycle
we will consider ncycle=Vcalc/np
Total core energy cost
Energy is not dependant on np !!
Total loop computation volumeAverage toggle rate
16E 1
0core
p
kjjp
kkpmcalcmfpfcalcf AnDVKAnDVK
![Page 22: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/22.jpg)
Séminaire COSI-Roscoff’01 22
Content Context and motivations Target architectures Partitioning Modeling Power Experimental results
Model validation Extrapolations
Conclusion
![Page 23: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/23.jpg)
Séminaire COSI-Roscoff’01 23
IO power model results
510
1520
25
510
1520
250
50
100
150
x
y
Pow
er (m
w)
510
1520
25
510
1520
250
50
100
150
x
y
Pow
er (m
w)
510
1520
25
510
1520
250
50
100
150
x
y
Powe
r (m
w)
510
1520
25
510
1520
250
20
40
60
80
100
x
y
Rel
ativ
e er
ror(%
)
Observed IO power dissipation Predicted IO power dissipation
Relative errorAbsolute error
![Page 24: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/24.jpg)
Séminaire COSI-Roscoff’01 24
Core power model results
510
1520
25
510
1520
250
100
200
300
400
x
y
Powe
r (m
W)
510
1520
25
510
1520
250
100
200
300
400
x
y
Powe
r (m
W)
510
1520
25
510
1520
250
100
200
300
400
x
y
Powe
r (m
w)
510
1520
25
510
1520
25
0
50
100
x
y
Relative error (%)
Predicted core powerObserved core power
Absolute error(mw)
![Page 25: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/25.jpg)
Séminaire COSI-Roscoff’01 25
System power model
510
1520
25
510
1520
250
50
100
150
x
y
Loop
exe
cutio
n en
ergy
cost(
J)
510
1520
25
510
1520
250
50
100
150
x
y
Loop
exe
cutio
n en
ergy
cost(
J)
510
1520
25
510
1520
250
50
100
150
200
x
y
Ener
gy (J
)
510
1520
25
510
1520
250
20
40
60
80
100
x
y
Rela
tive
erro
r(%)
Predicted total energy dissipation Observed total energy dissipation
Energy dissipation absolute error Energy dissipation relative error
![Page 26: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/26.jpg)
Séminaire COSI-Roscoff’01 26
Content Context and motivations Target architectures Partitioning modeling Power Experimental results Conclusion
Solving the optimisation problem (Lagrange Multipliers) Custom cache for embedded CPUs Extension to SAREs (affine dependances)
![Page 27: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/27.jpg)
Séminaire COSI-Roscoff’01 27
Conclusion Models matches experiments
Cheap measurement setup Many components contribute to current
dissipation (LEDs, PCI, etc…) Observations
Trade-off evolves with technology More sensitive for Asics ?
![Page 28: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/28.jpg)
Séminaire COSI-Roscoff’01 28
Future Work(1/2) Formulation of the optimization pb
Minimize Energy/iteration Contraints on Performance and Area
Analitycal solution ? Lagrange multipliers No closed form for n>3 BUT fast numerical methods
![Page 29: Séminaire COSI ’01](https://reader036.vdocument.in/reader036/viewer/2022062502/56814c4f550346895db95d83/html5/thumbnails/29.jpg)
Séminaire COSI-Roscoff’01 29
Future Work(2/2) Model for embedded CPUs
Trade-off cache-size and memory acceses. Determine optimal cache size and associated
tiling parameters. Extension to SARE ?
Affine dependencies. More general loops.