fpga architecture support for heterogeneous, relocatable...
TRANSCRIPT
1
24th International Conferenceon Field Programmable Logic and Applications September 3rd, 2014
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 1
FPGA Architecture Support for Heterogeneous, Relocatable Partial
Bitstreams
Christophe HURIAUXv, Olivier SENTIEYSv★, Russell TESSIER✜
University of Rennes 1, France vInria, France ★
University of Massachusetts, USA ✜
2
Outline§ Introduction
§ Overview of the FlexTiles project§ Architecture Overview§ Advantages of 3-D Stacking
§ Principles§ Task Migration in an FPGA§ Task Migration in FlexTiles§ Heterogeneous case
§ Approach§ Coping with Heterogeneity§ Design Constraints
§ Results§ Implementation in VPR
§ Conclusion
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 2
3
FP7 FlexTiles Project
§ FlexTiles: Self adaptive heterogeneous manycore based on Flexible Tiles
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 3
§ Provide a heterogeneous many-core architecture offering § Large flexibility§ High-performance, energy efficiency§ Raised programming efficiency§ Self-adaptation through virtualization
4
Architecture Overview
§ 3D-Stacked Heterogeneous manycore§ General Purpose Processors (GPP)
§ for flexibility and programming homogeneity§ Network On Chip§ Dedicated hardware accelerators mapped at
run-time on a reconfigurable layer
§ Reconfigurable layer with seamless task migration capabilities
§ Virtualization layer to provide an abstraction of the manycore and self adaptive services
§ Tool-chain for parallelization and compilation
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 4
5
Architecture Overview
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 5- 5
3D interface to the NoC
DSP blocks
Memory blocks
6
Task migration
§ Classical problem in dynamic reconfiguration[1]§ Enhance resource usage
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 6
4x4?
[1] K. Compton, Z. Li, J. Cooley, S. Knol, and S. Hauck, “Configuration relocation and defragmentation for run-time reconfigurable computing,” IEEE Transactions on VLSI Systems, vol. 10, no. 3, pp. 209 –220, 2002.
7
3D Stacking
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 7- 7
Core Core CoreCore Core Core
Core Core Core
reconfigurable layer
multicore layer
§ 3D-Stacked Reconfigurable Accelerators§ Improved resource usage§ Improved bandwidth/latency§ Improved performance and energy efficiency
Core Core CoreCore Core Core
Core Core Core
8
Task Migration in an FPGA
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 8
§ Predefined reconfigurable regions
§ Bit-stream depends on task location
I/O I/O I/O I/O I/O I/O I/O
I/O I/O I/O I/O I/O I/O I/O
I/OI/O
I/OI/O
I/OI/O
I/OI/O
I/OI/O
I/OI/O
I/OI/O
I/OI/O
I/O
I/O
HW Accelerator #1
BS #1
HW Accelerator #1
BS #2
9
Task Migration in FlexTiles
§ A task is synthesized, placed & routed into a Virtual Bit-Stream (VBS)§ Independent from task physical location in the fabric§ No predefined configuration domains
§ Resource sharing/distribution easiness, simplified task migration
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 9
1 2 3 11 321 2
3 212
�
212
3
1 321
§ Reconfiguration controller generates final BS at run-time
10
Task Migration in FlexTiles
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 10
3D NI3D NI
3D NI3D NI
RAM DSP RAM DSP
RAM DSP RAM DSP
3D NI3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
3D NI
HW Accelerator #2
VBS #2
HW Accelerator #1
VBS #1
11
Heterogeneity
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 11
§ Homogeneous case§ No constraint on task placement§ Regular routing architecture
§ Cope with heterogeneity§ RAM, DSP, 3D I/Os§ Migration is limited
§ vertically to the same column§ to the next column containing same
complex blocks
TaskConfigured LELogic Element (LE)
12
Proposed architecture
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 12
§ Heterogeneous blocks routing is abstracted from logic routing§ Long lines allow a trade-off between placement
flexibility and routing complexity§ A two-level routing is performed at runtime:
§ Logic routing (as in the homogeneous case)§ Heterogeneous block routing through long lines
13
Design Constraints
§ I/Os are made through 3D Network Interfaces, spread over the reconfigurable fabric
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 13
Rec
onfig
urat
ion
RAM
Reconfiguration CTRL
MEM
DSP 3D NI
AI
3D NI
AI
DSPDSPDSPDSPDSPDSPDSPDSPDSPDSP
MEMMEMMEMMEMMEMMEMMEM
3D NI
AI3D NI
AI
3D NI
AI
3D NI
AI
3D NI
AI
3D NI
AI
DSPDSPDSPDSPDSP
MEMMEMMEM
3D NI
MEM
MEM
DSPDSPDSPDSPDSPDSPDSPDSPDSPDSPDSP
MEMMEMMEMMEMMEMMEMMEM
DSPDSPDSPDSPDSP
MEMMEMMEMMEM
AI
14
Implementation in VPR
§ Versatile Place and Route (VPR), open source CAD tool for placement and routing
§ Part of the Verilog To Routing (VTR) framework
§ Source code modified to implement ourtechniques and deal with our constraints§ Horizontal long-lines spread over partitions§ Separate homogeneous and heterogeneous routing
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 14
VPR and VTR: https://code.google.com/p/vtr-verilog-to-routing/
15
Implementation in VPR
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 15
X
X
Y X
X
Fc=0.5Fc=1
VPR Original Routing Model
§ Logic grid§ Block placement
§ X: simple block§ Y: 2 blocks tall
§ Mesh routing lines§ Switch boxes§ Interconnect
16
Implementation in VPR
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 16
YX
X
X
X
Enhanced Routing Model
§ Logic grid§ Block placement§ Block typing
§ X: homogeneous§ Y: heterogeneous
§ Mesh routing lines§ Long lines§ Switch boxes§ Interconnect
§ Homogeneous§ Heterogeneous
17
Results
§ Architecture based on a simplified Stratix IV with:§ Dual-port 144k memories§ Fracturable 36x36 multipliers
§ Evaluation on two criteria§ Delay of the critical path§ Minimum channel width
§ Number of tracks in the homogeneous routing channels
§ Minimum channel width determined by VPR§ Not directly related to silicon area
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 17
18
Results§ Benchmark set: VTR framework circuits [1]
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 18
[1] Rose, Jonathan, Luu, Jason, Yu, Chi Wai, et al. The VTR project: architecture and CAD for FPGAs from verilog to routing. In Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays. ACM, 2012. p. 77-86.
Circuit # Mem # Mult # LBbgm 0 11 2,174boundtop 1 0 2,977ch_intrinsics 1 0 272diffeq1 0 5 41diffeq2 0 5 43LU8PEEng 45 8 30mkDelayWorker32B 41 0 497mkPktMerge 15 0 17mkSMAdapter4B 5 0 181or1200 2 1 273raygentop 1 7 192stereovision1 0 38 990
19
Results: Delay
§ Estimation of the worst case delay§ Impossible to predict where connections to long lines
will be done§ Some channels crossing fixed-function blocks are
longer
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 19
20
Results: Delay
§ Only 2% delay increase (in average)
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 20
0
0,2
0,4
0,6
0,8
1
1,2
0,00
20,00
40,00
60,00
80,00
100,00
120,00
140,00
160,00proposed/classicns
Crit. Path (classic)
Crit. Path. (enhanced)
Crit. Path. (ratio)
21
Results: Min. Channel Width
§ 1.8X channel width increase on average§ Need for specific routing algorithms to deal with
the heterogeneous interconnection network
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 21
0
0,5
1
1,5
2
2,5
3
3,5
4
4,5
0,00
20,00
40,00
60,00
80,00
100,00
120,00
140,00
160,00proposed/classic# tracks
min W (classic)
min W (enhanced)
min W (ratio)
22
Conclusion
§ FPGA embedded in a 3D architecture§ More flexibility for task placement and/or
relocation§ Low impact on delay but cost on routing
resources§ Need to find a trade-off between flexibility and
area increase of additional connections
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 22
23
Thank you for your attention
More info on FlexTiles: http://www.flextiles.eu
C. Huriaux, O. Sentieys and R. Tessier September 3rd, 2014 - 23
24
Thank you for your attention
C. Huriaux, O. Sentieys and R. Tessier September 3rd, 2014 - 24
25
Virtual Bit-Stream: Example
§ Hiding routing details§ Full BS is 129 bits§ Could be reduced by
giving less details
Jan. 2014CAIRN project-team - 25
CLBIN[1]
CLBIN[2]
CLBIN[3] CLBOUT
CLBIN[0]
4567
12131415
0123
891011
16
17
18
1920
26
Virtual Bit-Stream: Example
§ Hiding routing details§ List of I/O and
connections§ 20 è 8 § 1 è 9 § 5 è 18
Jan. 2014CAIRN project-team - 26
4567
0123
89101116
17
18
1920
12131415
27
Results: BS Sizes on MCNC Benchmarks
0"
200"
400"
600"
800"
1000"
1200"
1400"
1600"
tseng" tseng" diffeq" diffeq" apex4" des" ex5p" misex3"
Kilo%bits)
Rou:ng"
Logic"
Jan. 2014CAIRN project-team - 27
28
Results: VBS Sizes on MCNC Benchmarks
44.4%$49.2%$ 47.2%$
55.2%$49.7%$
29.5%$ 27.4%$ 26.6%$
0.0%$
10.0%$
20.0%$
30.0%$
40.0%$
50.0%$
60.0%$
70.0%$
80.0%$
90.0%$
100.0%$
0$
200$
400$
600$
800$
1000$
1200$
1400$
1600$
tseng$ tseng$ diffeq$ diffeq$ apex4$ des$ ex5p$ misex3$
Kilo%bits)
BS$size$
VBS$size$
Compression$raBo$
Jan. 2014CAIRN project-team - 28
29
Introduction: Architecture Overview
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 29- 29
3D Access Pointto the NoC
30
Introduction: Architecture Overview
September 3rd, 2014C. Huriaux, O. Sentieys and R. Tessier - 30- 30
General Architecture Overview