dapr: design automation for partially reconfigurable fpgas shaon yousuf ph.d. student nsf chrec...

DAPR: DAPR: DDesign esign AAutomation for utomation for PPartially artially RReconfigurable econfigurable FPGAsFPGAs

Shaon YousufPh.D. Student

NSF CHREC Center, University of Florida

Dr. Ann Gordon-RossAssociate Professor of ECE

NSF CHREC Center, University of Florida

2

Dynamic Reconfiguration Dynamic reconfiguration can be beneficial to system designers Allows run-time hardware adaptation Enables time-multiplexing FPGA resources

Two types of dynamic reconfiguration Full reconfiguration (FR) Partial reconfiguration (PR)

PR isolates reconfiguration to a portion of the FPGA fabric

Hardware time-multiplexing and flexibilityDesigns loaded on demand on the same fabric

Power SavingsDesign A, B, & C stored in external memory

Execution Stalled!

Full bitstreams

Design A RequiredDesign C RequiredDesign B Required

Configuration Controller

Design A

Design B

Design C

External memoryFPGA Fabric

Design A executingDesign B executingDesign C executing

PR divides the FPGA into two regions Static design maps to static region Partially reconfigurable modules map to

partially reconfigurable regions (PRRs) Regions communicate using

Partition Pins (Parpins)

FR benefits is enhanced by PR Partial bitstreams are significantly

smaller than full bitstreams Reduced memory requirements Reduced power requirements Reduced reconfiguration time

Increased flexibility Design functionality changed by simply

loading a new partial bitstream

3

Partial Reconfiguration (PR)

Ce

ntr

al

C

on

tro

llin

g A

ge

nt

ICAP

Me

m C

on

tro

lle

r

Module A

Module B

Module C

Static Design PR Modules

PRR 1

PRR 2

Sta

tic R

egio

n

Full Bitstream:

Static design

Partial Bitstreams:

Module A & B

Partial Bitstreams:

Modules C & D

Module D

FPGA Fabric

Example with 2 PRRs

PR system design is constantly evolving Currently supported only by Xilinx Altera support announced

PR system design flow is complex PR system performance may suffer if

system design is not carefully considered Static region and PRRs design partitioning

in hardware description language (HDL) Setting PRR and Parpins placement

constraints (Floorplanning) Analyzing set floorplanned constraints timing

results to ensure acceptable design performance

PR benefits outweigh complexities Benefits applications such as

Software defined radios Image processing applications

4

PR Challenges and Motivations

HDL Synthesis

Set Design Constraints

Implement Static Design and PR Modules

HDL Design Description

Final Generated Bitstreams

Merge

Timing/Placement Analysis

Xilinx PR Implementation Flow

Xilinx PR Implementation Flow

Mandatory steps for PR

5

Contribution Currently there is insufficient PR system design support

HDL design partitioning is not straightforward Requires finding a balance between required performance and flexibility

PRR placements constraints must be set and evaluated manually Large design space as no formal process exists for determining

optimal PRR and PPs placements constraints during floorplanning

We present the DAPR design flow The Design Automation for Partial Reconfiguration design flow

Automatically generates design floorplans (candidate designs) Eliminates manual design space exploration

Outputs highest clock frequency design found Also outputs a Pareto optimal set of PR designs

Design points trade off clock frequency and partial bitstream size

The DAPR design flow can significantly reduce PR design time Makes PR design more accessible and amenable to system designers

Merge

Implement PR Modules

Implement Base Design

Timing/Placement Analysis


HDL Synthesis

DAPR Tool

Modified HDL Design Description

DAPR Design FlowDAPR Design Flow

System Designer Annotations

Merge

Implement Base Design

Implement PR Modules

Timing/Place-ment Analysis


HDL Synthesis

6

DAPR Design FlowManual Steps

Automated Steps

HDL Design Description

Final Generated Bitstreams

PR FlowPR Flow

DAPR design flow performs automated design space exploration

Outputs best found clock frequency design

Outputs Pareto optimal set of design points that tradeoff clock frequency and partial bitstream size

DAPR Tool Phases

Phase 1Information identification, extraction, and collection

Phase 2Candidate Generation

Phase 3Bitstream Generation

Phase 4Design Evaluation

7

DAPR Design Flow PhasesDAPR tool starts here Initial input

Modified VHDL

Top File

VHDL Top File

Phase 3Bitstream GenerationImplement and merge design

Output best found PR design’s full and partial bitstreams

Y

N

Phase 4Design Evaluation

Design Constraints File

(.dcs)

Design constraints

met

Iterations left >0

Y

N

Phase 2Candidate Generation

Synthesize modules and estimate resource requirements

Perform automated floorplanning

Device Information Library File

(.dil)

User Constraints File (.ucf)

Phase 1Information Identification, Extraction, and Collection

PR Automation Information File

(.prai)

Identify static region, PRRs, and design file names

DAPR tool generated Directory Structure

PR automation file (.prai) contains port map connection information

Device information library file (.dil) contains target device hardware resource information

Design constraints file (.dcf) contains device specific design constraints

DAPR tool file descriptions

Candidate Designs

Design constraints file

PR automation file

Virtex-4 Lx25 device

information library file

8

Candidate Floorplan Generation

N

InitializationStartStartY

StopStop Icurr < Imax

N

N

Y

Accept new

solution?

Store solution

Icurr < IrandY

Randomized PRR and Parpin

placement

Evaluate solution

Icurr = Current running iteration numberImax = Maximum iteration boundIrand = Random number

Icurr = Current running iteration numberImax = Maximum iteration boundIrand = Random number

Icurr == Irand

N

Load stored solution and

calculate initial

temperature

Create new solution

Evaluate solution

Y

Accept new

solution?

Update stored solution

Update temperature

Y

N

PRR placements and Parpin placements leverage simulated annealing (SA) algorithm SA algorithm trims design space exploration

time by initially using random placements SA algorithm leverage random placement solution

values and work towards optimal placements SA algorithm for placement

Place PRRs randomly across fabric to find initial good placement Cost function

PRR aspect ratio Communication distant between PRRs

Evaluate design Use Xilinx utilities to evaluate

partial bitstream size and clock frequency Calculate initial temperature

Use SA equation Vary PRR initial placement size

Vary distance between PRRs and PRR height Change PRR placement location to

explore better solution Depends on certain number of uphill moves,

downhill moves or total number of moves

9

Experimental Setup Software (Linux Environment)

Perl 5.12.1 Dot language interpreter Xilinx ISE 13.4

Synthesize options Optimization Goal - Speed Optimization Effort - Normal

Hardware Intel® Core™ 2 Duo E6750 2.66 GHz CPU

and 3.24 GB of RAM Xilinx Virtex-5 XUPV5-LX110T FPGA

10

Results – DAPR Simulated Annealing Placement PRR counter design used for evaluation

Optimal clock frequency found using exhaustive (ES) search algorithm

Percentage of design space exploration to find the optimal clock frequency by Simulated annealing (SA) algorithm found

SA also compared with a random exploration (RE) algorithm

PRR Size 5x6 CLBs - SA

PRR Size 7x6 CLBs- SA


PRR Size 7x6 CLBs - RE

DAPR SA algorithm improves with increased PRR size

DAPR SA algorithm outperforms RE for large PRR sizes


Configurable Logic Blocks (CLBs)

11

Results – 1K-point FFT

Solution improves with successful iterations*

*Successful iterations complete without place and route errors

12

Results – Pareto Optimal Set

Only 3% of the explored design space is interesting*

*Design points trade off clock frequency and partial bitstream size

13

Results – Additional PR Designs

Growth Rate quickly levels off to within 2.3 % of the highest achievable clock frequency within an average of 10 successful iterations

DAPR++ tool suite aids designing RC systems using automationDAPR++ tool suite aids designing RC systems using automation

Evolution of DAPR to DAPR++ Tool SuiteEvolution of DAPR to DAPR++ Tool Suite

14

• Creates master and slave FPGA component layout tree

• Creates FPGA VHDL black boxes for all components

• Creates master and slave FPGA component layout tree

• Creates FPGA VHDL black boxes for all components

• Automatically generates target device resource mapping

• Heuristically floorplans PRRs and partition pins

• Automatically generates target device resource mapping

• Heuristically floorplans PRRs and partition pins

• Allow PRR partial bitstream manipulation in FPGA memory

• Allow PRR partial bitstream manipulation in FPGA memory

• Creates network protocols for master and slave FPGAs

• Creates network protocols for master and slave FPGAs

• Creates PR task reconfiguration schedules to reduce reconfiguration time

• Creates PR task reconfiguration schedules to reduce reconfiguration time

• Records data packet transfer rates between master and slave FPGAs

• Records data packet transfer rates between master and slave FPGAs

Switch Switch

Master FPGA

Slave FPGA

1

Slave FPGA

1GPPGPP

PRRsPRRs

Slave FPGA

2

Slave FPGA

2PRRsPRRs

PR System Design Automation with DAPR++ Tool PR System Design Automation with DAPR++ Tool SuiteSuite

15

DAPR++ Architecture Generation Leverages Master-Slave Configuration

Master FPGA used for centralized control Contains single or multiple ARM-compatible 32-bit RISC

processor (Amber processor) for application control Loads/unloads hardware tasks into slave FPGAs Transfers and receives data to slave FPGAs via

WishBone-interface-compatible system bus

Slave FPGAs used for hardware acceleration Leverages PRML output for appropriate hardware architecture

generation (Number of PRRs, PRR interfaces, & PRR size) PRRs loaded with required application task functionality Leverages on-chip network for inter-PRR communication

16

ARM-compatible 32-bit RISC processor - Two available cores Amber 23

3-stage pipeline, a unified instruction & data cache, Wishbone interface, and 0.75 Dhrystone MIPS per MHz

Amber 25 5-stage pipeline, separate data and instruction caches,

Wishbone interface, and 1.0 Dhrystone MIPS per MHz

Both cores run 2.4 Linux kernel

Amber 25 performs 30% - 40% but also 30% to 40% larger

Amber 25 Pipeline Amber 23 Pipeline

PR System Design Automation with DAPR++ Tool SuitePR System Design Automation with DAPR++ Tool Suite

17

Core type and Cache Configuration

Slices RAMB16 DSPs Clock frequency

A23 32KB 3236 2 2 233MHz

A25 32KB 3800 2 4 250MHz

Wishbone Interface 256 0 0 350MHz

Ethernet Core 1200 0 0 180MHz

Overall System with one core with A25

8556 4 6 133MHz

Amber core Virtex-5 LX110T Initial synthesis results shown below


Master FPGA System overviewMaster FPGA System overview

18

Wishbonearbiter

Amber 25 Processor Core 0Amber 25 Processor Core 0

Ethernet MACEthernet MAC

Boot Loader-8k embedded SRAM –

Contains boot loader code

Boot Loader-8k embedded SRAM –

Contains boot loader code

Primary Interrupt controllerPrimary Interrupt controller

Wishbone to Xilinx Virtex-5 SRAM controller bridge

Wishbone to Xilinx Virtex-5 SRAM controller bridge

Xilinx Virtex-5 SRAM controller

Xilinx Virtex-5 SRAM controller

SRAM Interface

SRAM Interface

firqfirqirqirq


Statically configurable simple UART

Statically configurable simple UART


Ethernet InterfaceEthernet Interface

UART Interface

UART Interface


19

Name Source Assertion Description

ACK_I Master indicates the normal termination of a bus cycle

ADR_O() Master used to pass a binary address

CYC_O Master Indicates a valid bus cycle is in progress.

ERR_I Master indicates an abnormal cycle termination.

LOCK_O Master indicates that the current bus cycle is uninterruptible

RTY_I Master indicates that the interface is not ready to accept or send data

SEL_O() Master indicates where valid data is expected

STB_O Master indicates a valid data transfer cycle.

TGA_O() Master contains information associated with address lines

TGC_O() Master contains information associated with bus cycles,

WE_O Master indicates whether the current local bus cycle is a READ or WRITE cycle

ACK_O() Slave indicates the termination of a normal bus cycle

ADR_I() Slave used to pass a binary address

CYC_I Slave indicates that a valid bus cycle is in progress

ERR_O Slave indicates an abnormal cycle termination

LOCK_I Slave indicates that the current bus cycle is uninterruptible.

RTY_O Slave indicates that the interface is not ready to accept or send data

SEL_I() Slave indicates where valid data is placed on data bus during write

STB_I Slave indicates that the SLAVE is selected

TGA_I Slave contains information associated with address lines

TGC_I() Slave contains information associated with bus cycles

WE_I Slave indicates whether the current local bus cycle is a READ or WRITE cycle

Wishbone Interface Extended Signal ListWishbone Interface Extended Signal ListSignal name and size Signal type

relative to data-flow controller

Function

p_consumerfsl_rdy (1 bit) In Indicates valid input data in consumer FSL

p_producerfsl_rdy (1 bit) In Indicates producer FSL is ready for data

rfd (1 bit) In Indicates PRM is ready for data

done (1 bit) In Indicates PRM will produce valid output data in next clock cycle

dv (1 bit) In Indicates PRM has produced valid out data

input_data (32 bit) In Input data signal to PRM

P_producerfsl_data (32 bit) In Input data signal from producer FSL

p_consumerfsl_en (1 bit) Out Allows reading data from consumer FSL

ce (1 bit) Out Halts PRM in current state (overrides start signal)

start (1 bit) Out Starts PRM when asserted

p_producerfsl_en (1 bit) Out Allows writing data to producer FSL

output_data (32 bit) Out Output data signal from PRM

p_consumerfsl_data (32 bit) Out Output data signal to consumer FSL

PRR Interface Signal ListPRR Interface Signal List


Task B: PR System Design Automation with Task B: PR System Design Automation with DAPR++ Tool SuiteDAPR++ Tool Suite

20

PR-Task Manager (PRTM)PR-Task Manager (PRTM) Utilizes Hardware reuse (Reconfiguration overlapping) and configuration prefetching PRTM tested on a JPEG Codec architecture

Simplified PR-Task Scheduler Flowchart

Create Application PR Task Schedule

Map Tasks to PRRs According to PRR Size

ModularizedC/C++

Application

ModularizedC/C++

Application

Application Task Flow

Graph (TFG)

Application Task Flow

Graph (TFG)

Check for Task Hardware Reusablity

Slave Architecture

PRR information

Slave Architecture

PRR information

Create Task Pre-configuration Schedule

Execute Application

RGB2YCbCr & FDCT2D ZigZag Quantizer Huffman EncoderRun length Encoder

Byte Stuffer

Header Generator

Decoder Pipeline Controller

21

Co

ntr

ol

Sig

nal

s

DEMUX

Encoder Pipeline Controller

MUX

RAMHOST PROG

HOST DATA

Host IFDat

a

Co

ntr

ol

Sig

nal

s

PR RegionC

on

tro

l S

ign

als

PR Module

Buffer

JPEG Codec Encoder ArchitectureEncoder Data Path

Task B: PR System Design Automation with DAPR++ Tool SuiteTask B: PR System Design Automation with DAPR++ Tool Suite

Run Length Decoder

Byte Stuffer

Header GeneratorHeader Decoder

JPEG Codec Decoder Architecture

YCbCr2RGB & IDCT2D Reorder Dequantizer Huffman Decoder

Decoder Pipeline Controller

22

Co

ntr

ol

Sig

nal

s

DEMUX

MUX

RAMHOST PROG

HOST DATA

Host IFDat

a

Co

ntr

ol

Sig

nal

s

PR RegionC

on

tro

l S

ign

als

PR Module

Buffer Byte Stripper

Decoder Data Path

Task B: PR System Design Automation with DAPR++ Tool SuiteTask B: PR System Design Automation with DAPR++ Tool Suite

23

JPEG CODECJPEG CODEC

Tests reveal PRTM based systems achieve an average 40% reconfiguration delay reduction1

20% Reduction 60% Reduction

Task B: PR System Design Automation with Task B: PR System Design Automation with DAPR++ Tool SuiteDAPR++ Tool Suite

24PRR – Partially Reconfigurable RegionPRR – Partially Reconfigurable Region


Networking Tool OverviewNetworking Tool Overview Sets up Master and Slave FPGAs network interfaces

Automatically generates hardware/software controllers Master FPGA GPP to slave FPGA controller Slave FPGA hardware controller for PRRs

0Amber

Processor 1

Amber Processor

1

EthernetEthernet

Simplified Master FPGA

Architecture

Amber Processor

0

Amber Processor

0

Amber processor

2

Amber processor

2

WishBone Interface

WishBone Interface

0

EthernetEthernet

Simplified Slave FPGA

Architecture

WishBone Interface

WishBone Interface

HW controller

HW controller

PRR 1PRR 1 PRR 1PRR 1 PRR 1PRR 1

Results: Resource Requirements Results: Resource Requirements Component Slices RAMB16 Max Operating

Frequency

Slave FPGA PRRs 1,600 2 250MHz

100 Mbps Ethernet Core

1,100 4 180MHz

Master FPGA Processor with

8KB cache4,820 10 250MHz

Slave FPGA Overall system with 2 PRRs

6,956 20 155MHz

Master FPGA system with two GPPs

11,240 28 133MHz

Results: Network TransfersResults: Network Transfers

Master/Slave FPGA setup with 2 GPPs/PRRs Simple Transmission test

Amber processors sends data to PRRs PRRs rotates bit value and transfer back result

FFT, CORDIC, Matrix Multiply cores GPPs send data from Master FPGA ram to PRRs PRRs process data and transfer back result

Networking Tool Experimental Setup Networking Tool Experimental Setup

Tested cores an average throughput of 41 Mpbs

25

Conclusions and Future Work We presented the DAPR and DAPR++ design flow

DAPR performs automatic design space exploration Uses an iterative candidate PR floorplan generation methodology

DAPR design flow’s key contributions include: Making PR design more accessible and amenable to a wide range

of system designers Creating high-performance systems with reduced design time effort Allows choosing between PR designs that trade off clock frequency

and partial bitstream size DAPR++ tool suite allows Automated RC system generation

Each tool generates different RC System portions tailored to application needs Portions are integrated to build complete RC system ready for use on selected FPGAs

Future Work Enhance portability of DAPR++ tool sutie to multiple devices and vendors

QUESTIONS?

This work was supported in part by the I/UCRC Program of the National Science Foundation under Grant No. EEC-0642422. We also gratefully acknowledge tools provided by Xilinx.

dapr: design automation for partially reconfigurable fpgas shaon yousuf ph.d. student nsf chrec...

Documents

pr design timemakes

prrspr system design

prrs design partitioning

regionsstatic design

pr challenges

complexpr system performance

new partial bitstream

static regionfull bitstream