dapr: design automation for partially reconfigurable fpgas shaon yousuf ph.d. student nsf chrec...
TRANSCRIPT
DAPR: DAPR: DDesign esign AAutomation for utomation for PPartially artially RReconfigurable econfigurable FPGAsFPGAs
Shaon YousufPh.D. Student
NSF CHREC Center, University of Florida
Dr. Ann Gordon-RossAssociate Professor of ECE
NSF CHREC Center, University of Florida
2
Dynamic Reconfiguration Dynamic reconfiguration can be beneficial to system designers Allows run-time hardware adaptation Enables time-multiplexing FPGA resources
Two types of dynamic reconfiguration Full reconfiguration (FR) Partial reconfiguration (PR)
PR isolates reconfiguration to a portion of the FPGA fabric
Hardware time-multiplexing and flexibilityDesigns loaded on demand on the same fabric
Power SavingsDesign A, B, & C stored in external memory
Execution Stalled!
Full bitstreams
Design A RequiredDesign C RequiredDesign B Required
Configuration Controller
Design A
Design B
Design C
External memoryFPGA Fabric
Design A executingDesign B executingDesign C executing
PR divides the FPGA into two regions Static design maps to static region Partially reconfigurable modules map to
partially reconfigurable regions (PRRs) Regions communicate using
Partition Pins (Parpins)
FR benefits is enhanced by PR Partial bitstreams are significantly
smaller than full bitstreams Reduced memory requirements Reduced power requirements Reduced reconfiguration time
Increased flexibility Design functionality changed by simply
loading a new partial bitstream
3
Partial Reconfiguration (PR)
Ce
ntr
al
C
on
tro
llin
g A
ge
nt
ICAP
Me
m C
on
tro
lle
r
Module A
Module B
Module C
Static Design PR Modules
PRR 1
PRR 2
Sta
tic R
egio
n
Full Bitstream:
Static design
Partial Bitstreams:
Module A & B
Partial Bitstreams:
Modules C & D
Module D
FPGA Fabric
Example with 2 PRRs
PR system design is constantly evolving Currently supported only by Xilinx Altera support announced
PR system design flow is complex PR system performance may suffer if
system design is not carefully considered Static region and PRRs design partitioning
in hardware description language (HDL) Setting PRR and Parpins placement
constraints (Floorplanning) Analyzing set floorplanned constraints timing
results to ensure acceptable design performance
PR benefits outweigh complexities Benefits applications such as
Software defined radios Image processing applications
4
PR Challenges and Motivations
HDL Synthesis
Set Design Constraints
Implement Static Design and PR Modules
HDL Design Description
Final Generated Bitstreams
Merge
Timing/Placement Analysis
Xilinx PR Implementation Flow
Xilinx PR Implementation Flow
Mandatory steps for PR
5
Contribution Currently there is insufficient PR system design support
HDL design partitioning is not straightforward Requires finding a balance between required performance and flexibility
PRR placements constraints must be set and evaluated manually Large design space as no formal process exists for determining
optimal PRR and PPs placements constraints during floorplanning
We present the DAPR design flow The Design Automation for Partial Reconfiguration design flow
Automatically generates design floorplans (candidate designs) Eliminates manual design space exploration
Outputs highest clock frequency design found Also outputs a Pareto optimal set of PR designs
Design points trade off clock frequency and partial bitstream size
The DAPR design flow can significantly reduce PR design time Makes PR design more accessible and amenable to system designers
Merge
Implement PR Modules
Implement Base Design
Timing/Placement Analysis
Set Design Constraints
HDL Synthesis
DAPR Tool
Modified HDL Design Description
DAPR Design FlowDAPR Design Flow
System Designer Annotations
Merge
Implement Base Design
Implement PR Modules
Timing/Place-ment Analysis
Set Design Constraints
HDL Synthesis
6
DAPR Design FlowManual Steps
Automated Steps
HDL Design Description
Final Generated Bitstreams
PR FlowPR Flow
DAPR design flow performs automated design space exploration
Outputs best found clock frequency design
Outputs Pareto optimal set of design points that tradeoff clock frequency and partial bitstream size
DAPR Tool Phases
Phase 1Information identification, extraction, and collection
Phase 2Candidate Generation
Phase 3Bitstream Generation
Phase 4Design Evaluation
7
DAPR Design Flow PhasesDAPR tool starts here Initial input
Modified VHDL
Top File
VHDL Top File
Phase 3Bitstream GenerationImplement and merge design
Output best found PR design’s full and partial bitstreams
Y
N
Phase 4Design Evaluation
Design Constraints File
(.dcs)
Design constraints
met
Iterations left >0
Y
N
Phase 2Candidate Generation
Synthesize modules and estimate resource requirements
Perform automated floorplanning
Device Information Library File
(.dil)
User Constraints File (.ucf)
Phase 1Information Identification, Extraction, and Collection
PR Automation Information File
(.prai)
Identify static region, PRRs, and design file names
DAPR tool generated Directory Structure
PR automation file (.prai) contains port map connection information
Device information library file (.dil) contains target device hardware resource information
Design constraints file (.dcf) contains device specific design constraints
DAPR tool file descriptions
Candidate Designs
Design constraints file
PR automation file
Virtex-4 Lx25 device
information library file
8
Candidate Floorplan Generation
N
InitializationStartStartY
StopStop Icurr < Imax
N
N
Y
Accept new
solution?
Store solution
Icurr < IrandY
Randomized PRR and Parpin
placement
Evaluate solution
Icurr = Current running iteration numberImax = Maximum iteration boundIrand = Random number
Icurr = Current running iteration numberImax = Maximum iteration boundIrand = Random number
Icurr == Irand
N
Load stored solution and
calculate initial
temperature
Create new solution
Evaluate solution
Y
Accept new
solution?
Update stored solution
Update temperature
Y
N
PRR placements and Parpin placements leverage simulated annealing (SA) algorithm SA algorithm trims design space exploration
time by initially using random placements SA algorithm leverage random placement solution
values and work towards optimal placements SA algorithm for placement
Place PRRs randomly across fabric to find initial good placement Cost function
PRR aspect ratio Communication distant between PRRs
Evaluate design Use Xilinx utilities to evaluate
partial bitstream size and clock frequency Calculate initial temperature
Use SA equation Vary PRR initial placement size
Vary distance between PRRs and PRR height Change PRR placement location to
explore better solution Depends on certain number of uphill moves,
downhill moves or total number of moves
9
Experimental Setup Software (Linux Environment)
Perl 5.12.1 Dot language interpreter Xilinx ISE 13.4
Synthesize options Optimization Goal - Speed Optimization Effort - Normal
Hardware Intel® Core™ 2 Duo E6750 2.66 GHz CPU
and 3.24 GB of RAM Xilinx Virtex-5 XUPV5-LX110T FPGA
10
Results – DAPR Simulated Annealing Placement PRR counter design used for evaluation
Optimal clock frequency found using exhaustive (ES) search algorithm
Percentage of design space exploration to find the optimal clock frequency by Simulated annealing (SA) algorithm found
SA also compared with a random exploration (RE) algorithm
PRR Size 5x6 CLBs - SA
PRR Size 7x6 CLBs- SA
PRR Size 6x6 CLBs - SA
PRR Size 7x6 CLBs - RE
DAPR SA algorithm improves with increased PRR size
DAPR SA algorithm outperforms RE for large PRR sizes
PRR Size 4x6 CLBs - SA
Configurable Logic Blocks (CLBs)
11
Results – 1K-point FFT
Solution improves with successful iterations*
*Successful iterations complete without place and route errors
12
Results – Pareto Optimal Set
Only 3% of the explored design space is interesting*
*Design points trade off clock frequency and partial bitstream size
13
Results – Additional PR Designs
Growth Rate quickly levels off to within 2.3 % of the highest achievable clock frequency within an average of 10 successful iterations
DAPR++ tool suite aids designing RC systems using automationDAPR++ tool suite aids designing RC systems using automation
Evolution of DAPR to DAPR++ Tool SuiteEvolution of DAPR to DAPR++ Tool Suite
14
• Creates master and slave FPGA component layout tree
• Creates FPGA VHDL black boxes for all components
• Creates master and slave FPGA component layout tree
• Creates FPGA VHDL black boxes for all components
• Automatically generates target device resource mapping
• Heuristically floorplans PRRs and partition pins
• Automatically generates target device resource mapping
• Heuristically floorplans PRRs and partition pins
• Allow PRR partial bitstream manipulation in FPGA memory
• Allow PRR partial bitstream manipulation in FPGA memory
• Creates network protocols for master and slave FPGAs
• Creates network protocols for master and slave FPGAs
• Creates PR task reconfiguration schedules to reduce reconfiguration time
• Creates PR task reconfiguration schedules to reduce reconfiguration time
• Records data packet transfer rates between master and slave FPGAs
• Records data packet transfer rates between master and slave FPGAs
Switch Switch
Master FPGA
Slave FPGA
1
Slave FPGA
1GPPGPP
PRRsPRRs
Slave FPGA
2
Slave FPGA
2PRRsPRRs
PR System Design Automation with DAPR++ Tool PR System Design Automation with DAPR++ Tool SuiteSuite
15
DAPR++ Architecture Generation Leverages Master-Slave Configuration
Master FPGA used for centralized control Contains single or multiple ARM-compatible 32-bit RISC
processor (Amber processor) for application control Loads/unloads hardware tasks into slave FPGAs Transfers and receives data to slave FPGAs via
WishBone-interface-compatible system bus
Slave FPGAs used for hardware acceleration Leverages PRML output for appropriate hardware architecture
generation (Number of PRRs, PRR interfaces, & PRR size) PRRs loaded with required application task functionality Leverages on-chip network for inter-PRR communication
16
ARM-compatible 32-bit RISC processor - Two available cores Amber 23
3-stage pipeline, a unified instruction & data cache, Wishbone interface, and 0.75 Dhrystone MIPS per MHz
Amber 25 5-stage pipeline, separate data and instruction caches,
Wishbone interface, and 1.0 Dhrystone MIPS per MHz
Both cores run 2.4 Linux kernel
Amber 25 performs 30% - 40% but also 30% to 40% larger
Amber 25 Pipeline Amber 23 Pipeline
PR System Design Automation with DAPR++ Tool SuitePR System Design Automation with DAPR++ Tool Suite
17
Core type and Cache Configuration
Slices RAMB16 DSPs Clock frequency
A23 32KB 3236 2 2 233MHz
A25 32KB 3800 2 4 250MHz
Wishbone Interface 256 0 0 350MHz
Ethernet Core 1200 0 0 180MHz
Overall System with one core with A25
8556 4 6 133MHz
Amber core Virtex-5 LX110T Initial synthesis results shown below
PR System Design Automation with DAPR++ Tool SuitePR System Design Automation with DAPR++ Tool Suite
Master FPGA System overviewMaster FPGA System overview
18
Wishbonearbiter
Amber 25 Processor Core 0Amber 25 Processor Core 0
Ethernet MACEthernet MAC
Boot Loader-8k embedded SRAM –
Contains boot loader code
Boot Loader-8k embedded SRAM –
Contains boot loader code
Primary Interrupt controllerPrimary Interrupt controller
Wishbone to Xilinx Virtex-5 SRAM controller bridge
Wishbone to Xilinx Virtex-5 SRAM controller bridge
Xilinx Virtex-5 SRAM controller
Xilinx Virtex-5 SRAM controller
SRAM Interface
SRAM Interface
firqfirqirqirq
Amber 25 Processor Core 1Amber 25 Processor Core 1
Statically configurable simple UART
Statically configurable simple UART
Amber 25 Processor Core 1Amber 25 Processor Core 1
Ethernet InterfaceEthernet Interface
UART Interface
UART Interface
PR System Design Automation with DAPR++ Tool SuitePR System Design Automation with DAPR++ Tool Suite
19
Name Source Assertion Description
ACK_I Master indicates the normal termination of a bus cycle
ADR_O() Master used to pass a binary address
CYC_O Master Indicates a valid bus cycle is in progress.
ERR_I Master indicates an abnormal cycle termination.
LOCK_O Master indicates that the current bus cycle is uninterruptible
RTY_I Master indicates that the interface is not ready to accept or send data
SEL_O() Master indicates where valid data is expected
STB_O Master indicates a valid data transfer cycle.
TGA_O() Master contains information associated with address lines
TGC_O() Master contains information associated with bus cycles,
WE_O Master indicates whether the current local bus cycle is a READ or WRITE cycle
ACK_O() Slave indicates the termination of a normal bus cycle
ADR_I() Slave used to pass a binary address
CYC_I Slave indicates that a valid bus cycle is in progress
ERR_O Slave indicates an abnormal cycle termination
LOCK_I Slave indicates that the current bus cycle is uninterruptible.
RTY_O Slave indicates that the interface is not ready to accept or send data
SEL_I() Slave indicates where valid data is placed on data bus during write
STB_I Slave indicates that the SLAVE is selected
TGA_I Slave contains information associated with address lines
TGC_I() Slave contains information associated with bus cycles
WE_I Slave indicates whether the current local bus cycle is a READ or WRITE cycle
Wishbone Interface Extended Signal ListWishbone Interface Extended Signal ListSignal name and size Signal type
relative to data-flow controller
Function
p_consumerfsl_rdy (1 bit) In Indicates valid input data in consumer FSL
p_producerfsl_rdy (1 bit) In Indicates producer FSL is ready for data
rfd (1 bit) In Indicates PRM is ready for data
done (1 bit) In Indicates PRM will produce valid output data in next clock cycle
dv (1 bit) In Indicates PRM has produced valid out data
input_data (32 bit) In Input data signal to PRM
P_producerfsl_data (32 bit) In Input data signal from producer FSL
p_consumerfsl_en (1 bit) Out Allows reading data from consumer FSL
ce (1 bit) Out Halts PRM in current state (overrides start signal)
start (1 bit) Out Starts PRM when asserted
p_producerfsl_en (1 bit) Out Allows writing data to producer FSL
output_data (32 bit) Out Output data signal from PRM
p_consumerfsl_data (32 bit) Out Output data signal to consumer FSL
PRR Interface Signal ListPRR Interface Signal List
PR System Design Automation with DAPR++ Tool PR System Design Automation with DAPR++ Tool SuiteSuite
Task B: PR System Design Automation with Task B: PR System Design Automation with DAPR++ Tool SuiteDAPR++ Tool Suite
20
PR-Task Manager (PRTM)PR-Task Manager (PRTM) Utilizes Hardware reuse (Reconfiguration overlapping) and configuration prefetching PRTM tested on a JPEG Codec architecture
Simplified PR-Task Scheduler Flowchart
Create Application PR Task Schedule
Map Tasks to PRRs According to PRR Size
ModularizedC/C++
Application
ModularizedC/C++
Application
Application Task Flow
Graph (TFG)
Application Task Flow
Graph (TFG)
Check for Task Hardware Reusablity
Slave Architecture
PRR information
Slave Architecture
PRR information
Create Task Pre-configuration Schedule
Execute Application
RGB2YCbCr & FDCT2D ZigZag Quantizer Huffman EncoderRun length Encoder
Byte Stuffer
Header Generator
Decoder Pipeline Controller
21
Co
ntr
ol
Sig
nal
s
DEMUX
Encoder Pipeline Controller
MUX
RAMHOST PROG
HOST DATA
Host IFDat
a
Co
ntr
ol
Sig
nal
s
PR RegionC
on
tro
l S
ign
als
PR Module
Buffer
JPEG Codec Encoder ArchitectureEncoder Data Path
Task B: PR System Design Automation with DAPR++ Tool SuiteTask B: PR System Design Automation with DAPR++ Tool Suite
Run Length Decoder
Byte Stuffer
Header GeneratorHeader Decoder
JPEG Codec Decoder Architecture
YCbCr2RGB & IDCT2D Reorder Dequantizer Huffman Decoder
Decoder Pipeline Controller
22
Co
ntr
ol
Sig
nal
s
DEMUX
MUX
RAMHOST PROG
HOST DATA
Host IFDat
a
Co
ntr
ol
Sig
nal
s
PR RegionC
on
tro
l S
ign
als
PR Module
Buffer Byte Stripper
Decoder Data Path
Task B: PR System Design Automation with DAPR++ Tool SuiteTask B: PR System Design Automation with DAPR++ Tool Suite
23
JPEG CODECJPEG CODEC
Tests reveal PRTM based systems achieve an average 40% reconfiguration delay reduction1
20% Reduction 60% Reduction
Task B: PR System Design Automation with Task B: PR System Design Automation with DAPR++ Tool SuiteDAPR++ Tool Suite
24PRR – Partially Reconfigurable RegionPRR – Partially Reconfigurable Region
PR System Design Automation with DAPR++ Tool PR System Design Automation with DAPR++ Tool SuiteSuite
Networking Tool OverviewNetworking Tool Overview Sets up Master and Slave FPGAs network interfaces
Automatically generates hardware/software controllers Master FPGA GPP to slave FPGA controller Slave FPGA hardware controller for PRRs
0Amber
Processor 1
Amber Processor
1
EthernetEthernet
Simplified Master FPGA
Architecture
Amber Processor
0
Amber Processor
0
Amber processor
2
Amber processor
2
WishBone Interface
WishBone Interface
0
EthernetEthernet
Simplified Slave FPGA
Architecture
WishBone Interface
WishBone Interface
HW controller
HW controller
PRR 1PRR 1 PRR 1PRR 1 PRR 1PRR 1
Results: Resource Requirements Results: Resource Requirements Component Slices RAMB16 Max Operating
Frequency
Slave FPGA PRRs 1,600 2 250MHz
100 Mbps Ethernet Core
1,100 4 180MHz
Master FPGA Processor with
8KB cache4,820 10 250MHz
Slave FPGA Overall system with 2 PRRs
6,956 20 155MHz
Master FPGA system with two GPPs
11,240 28 133MHz
Results: Network TransfersResults: Network Transfers
Master/Slave FPGA setup with 2 GPPs/PRRs Simple Transmission test
Amber processors sends data to PRRs PRRs rotates bit value and transfer back result
FFT, CORDIC, Matrix Multiply cores GPPs send data from Master FPGA ram to PRRs PRRs process data and transfer back result
Networking Tool Experimental Setup Networking Tool Experimental Setup
Tested cores an average throughput of 41 Mpbs
25
Conclusions and Future Work We presented the DAPR and DAPR++ design flow
DAPR performs automatic design space exploration Uses an iterative candidate PR floorplan generation methodology
DAPR design flow’s key contributions include: Making PR design more accessible and amenable to a wide range
of system designers Creating high-performance systems with reduced design time effort Allows choosing between PR designs that trade off clock frequency
and partial bitstream size DAPR++ tool suite allows Automated RC system generation
Each tool generates different RC System portions tailored to application needs Portions are integrated to build complete RC system ready for use on selected FPGAs
Future Work Enhance portability of DAPR++ tool sutie to multiple devices and vendors
QUESTIONS?
This work was supported in part by the I/UCRC Program of the National Science Foundation under Grant No. EEC-0642422. We also gratefully acknowledge tools provided by Xilinx.