graduate computer architecture i lecture 16: fpga design

Graduate Computer Architecture I

Lecture 16: FPGA Design

2 - CSE/ESE 560M – Graduate Computer Architecture I

Emergence of FPGA

• Great for Prototyping and Testing– Enable logic verification without high cost of fab– Reprogrammable Research and Education– Meets most computational requirements– Options for transferring design to ASIC

• Technology Advances– Huge FPGAs are available

• Up to 200,000 Logic Units

– Above clocking rate of 500 MHz• Competitive Pricing


System on Chip (SoC)

• Large Embedded Memories– Up 10 Megabits of on-chip memories (Virtex 4)– High bandwidth and reconfigurable

• Processor IP Cores– Tons of Soft Processor Cores (some open source)– Embedded Processor Cores

• PowerPC, Nios RISC, and etc. – 450+ MHz– Simple Digital Signal Processing Cores

• Up to 512 DSPs on Virtex 4• Interconnects

– High speed network I/O (10Gbps)– Built-in Ethernet MACs (Soft/Hard Core)

• Security– Embedded 256-bit AES Encryption


Potential Advantages of FPGAs


Designing with FPGAs

• Opportunities– Hardware logics are programmable– Immediate testing on the actual platform

• Challenges– Programming Environment

• Think and design in 2-D instead of 1-D• Consider hardware limitations

– Hardware Synthesis• Smart language interpreter and translator• Efficient HW resource utilization


Today

• Programming Environment– Object Oriented Programming Model– Template based language editors– Hardware/Software Co-design– Still a disconnect between SW/HW methods– Lack of education to bring them together

• Hardware Synthesis– Getting smarter but not smart enough– Tuned specifically for each platform– Not able to take full advantage of resources– Manual tweaking and using templates


High Performance Design in FPGA

• Fine Grain Pipelining– Reducing Critical Path– One level of look-up-table between D-flip flop– Works best for streaming data with little or no data

dependencies• Logic Resource

– Smaller sizes often yield faster design– Use all available resources– Less resource map and place conflicts– Quicker compilation

• Parallel Engines– Exploit parallelism in application– Faster place and route


Pipelining

• DEFINITION:– a K-Stage Pipeline (“K-pipeline”) is an acyclic circuit

having exactly K registers on every path from an input to an output.

– a COMBINATIONAL CIRCUIT is thus an 0-stage pipeline.

• CONVENTION:– Every pipeline stage, hence every K-Stage pipeline,

has a register on its OUTPUT (not on its input).• ALWAYS:

– The CLOCK common to all registers must have a period sufficient to cover propagation over combinational paths + (input) register progation delay + (output) register setup time.


Bad pipelining

• You can not just randomly registers– Successive inputs get mixed: e.g., B(A(Xi+1), Yi)

– This happened because some paths from inputs to outputs have 2 registers, and some have only 1!

• Not a well-formed K pipeline!


Adding Pipelines

• Method– Draw a line that crosses every output

in the circuit and mark the endpoints as terminal points.

– Continue to draw new lines between the terminal points across various circuit connections, ensuring that every connection crosses each line in the same direction.

• These lines represent pipeline stages.• Adding a pipeline register at every

point where a separating line crosses a connection will always generate a valid pipeline

• Focus on the slowest part of the circuit


Pipelining Example

• 8 bit to 256 bit decoder– 256 different combination

library ieee;use ieee.std_logic_1164.all;

entity DECODER is port( I: in std_logic_vector(7 downto 0); O: out std_logic_vector(255 downto 0));end DECODER;

architecture behavioral of DECODER isbegin process (I) begin case I is when “00000000” => O <= “1000...0000”; when “00000001” => O <= “0100...0000”; when “00000010” => O <= “0010...0000”; ... when “11111110” => O <= “0000...0010”; when “11111111” => O <= “0000...0001”; end case; end process;end behavioral;

256 bits


Hardware Synthesis

• Synthesis– Uses at least three 4 to

1 Look-up-tables to decode 256 combinations of I(7:0)

• Resource Usage– 3-LUT4 X 256– 768 LUT4

• Critical Path– Input/Output pin delays– 2 levels of LUT4– Sometimes 3 levels?! – Virtex 4 – Speed 11

• 8.281 ns 121 Mhz

“2”“1”

I(7:0)

CombLogic for

“0”

…

CombLogic for

“255”

O(0)

O(1)

O(2)

O(254:3)

O(255)

LUT4

LUT4

LUT4


Pipelined Decoder• Input/Output pin DFF

– Already in most FPGAs– Minimizes pin latencies

• DFF after every LUT4– LUT4 always followed by DFF

(why not use it)– Only when possible– Minimizes logic latency

• FPGA Resource– 768 LUT4 as before– Plus 768 dff and 264 pin dff– But not really…

• Critical Path– 1 Level of LUT4– Plus small DFF prop delay and

setup– Virtex 4 – Speed 11

• 2.198 ns 455 Mhz• 3.76x Speedup

“2”“1”

I(7:0)

CombLogic for

“0”

…

CombLogic for

“255”

O(0)

O(1)

O(2)

O(254:3)

O(255)

LUT4

LUT4

LUT4


logic

Logic Resource

• Leveraging on FPGA Architecture– Similarity with Architecture– LUT and few special logic followed by DFF

• Smaller Design is often Faster– Easier for tools to Map, Place, and Route– Optimize designs wherever– In FPGA, each wire can has a large fanout limit– Reuse logic and results

Input Output

Fanout Capacity for the wire to drive the inputs to other logic


Reusing Logic• Synthesis Tools

– Obvious duplicate logics are automatically combined

– Most are not optimized• Decoder Example

– Two 4 bit to16 bit decoders– Combining decoder outputs– Two 16 bits to 256 bit

• Critical Path– 1 Level of LUT4– Approximately the same– Differences in wire delay

• FPGA Resources– I/O DFF remain same– 2 x 16 LUT4 and DFF– Plus 256 LUT4 and DFF– Total 272 LUT4 and DFF!

“2”“1”

CombLogic for

“0”

…

CombLogic for

“256”

O(0)

O(1)

O(2)

O(254:3)

O(255)

I(7:0) “0,2”“0,1”

ANDGate“0,0”

ANDGate

“15,15”

LUT4

Tw

o s

ets

of

4 to

16 d

eco

der

LUT4

LUT4


Virtex 4 – Elementary Logic Block

2 to 1 Multiplexors

1 bit D-Flip Flops

4 to 1 LUT


Using MUXF as 2-input Gates

0

a

MUXF0

1sel

b

ab

z z

MUXF0

a

0

1sel

b

ab

Inverters can be pushed into the LUT4 or DFF (by using inverted Q)

zz


Using Unused Multiplexors

• Decoder Example– Replace all LUT4 in

the 2nd Decoder stages with MUX based 2 input AND gates

• Critical Path– Same– 2.198 ns 455 Mhz

• FPGA Resources– I/O DFF remain same– 256 MUXF and DFF– 32 LUT4 and DFF

“2”“1”

CombLogic for

“0”

…

CombLogic for

“256”

O(0)

O(1)

O(2)

O(254:3)

O(255)

I(7:0) “0,2”“0,1”

ANDGate“0,0”

ANDGate

“15,15”

LUT4

Tw

o s

ets

of

4 to

16 d

eco

der

MUXF0

1sel


Parallel Design

• Use Area to Increase Performance– Increase the Input bandwidth (Input Bus width)

• Processing multiple data at a time– Duplicate engines to process independent data sets

• Thread/Object level parallelism• Instructional level parallelism

– Loop unroll to expose the parallelism– Excellent for Streaming Data Applications

• Multimedia• Network Processing

• Performance Scalability– Linear Performance increase with Size

• Achieved for many algorithms– Sometimes Exponential Hardware Size

• Try to scale using higher level of parallelism


Summary• FPGA Designing Methods

– Fine Grain Pipelining to Increase Clock Rate• If possible 1-level of LUT followed by DFF

– Parallel Engines to Increase Bandwidth• Duplicate logic to linearly increase the performance

– Reducing Logic Resource Usage• Reusing duplicate logics• Using all available embedded Logic• There are other logics (i.e. Embedded Procs, Large Memories,

Optimized primitive gates, and IP Cores)• Best Methods Today

– Learn about internal architecture of FPGA– Make your own templates and use them– Use IP Cores

• Future Research Topics– Integration of Generalize Pipelining Algorithms (In the works) – Smarter Synthesis Tools (Understanding HDL)– Automatic Platform Specific Optimization Techniques

graduate computer architecture i lecture 16: fpga design

Documents

graduate computer architecture

fpga design slide

route slide

stage pipeline

pipeline stage

faster design

aes encryption slide

competitive pricing